UNPKG

unfluff

Version:
986 lines (793 loc) 53.3 kB
<!DOCTYPE html> <html class=" "> <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# object: http://ogp.me/ns/object# article: http://ogp.me/ns/article# profile: http://ogp.me/ns/profile#"> <meta charset='utf-8'> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <title>ageitgey/node-unfluff · GitHub</title> <link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="GitHub" /> <link rel="fluid-icon" href="https://github.com/fluidicon.png" title="GitHub" /> <link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-114.png" /> <link rel="apple-touch-icon" sizes="114x114" href="/apple-touch-icon-114.png" /> <link rel="apple-touch-icon" sizes="72x72" href="/apple-touch-icon-144.png" /> <link rel="apple-touch-icon" sizes="144x144" href="/apple-touch-icon-144.png" /> <meta property="fb:app_id" content="1401488693436528"/> <meta content="@github" name="twitter:site" /><meta content="summary" name="twitter:card" /><meta content="ageitgey/node-unfluff" name="twitter:title" /><meta content="node-unfluff - Automatically extract body content (and other cool stuff) from an html document" name="twitter:description" /><meta content="https://avatars1.githubusercontent.com/u/896692?s=400" name="twitter:image:src" /> <meta content="GitHub" property="og:site_name" /><meta content="object" property="og:type" /><meta content="https://avatars1.githubusercontent.com/u/896692?s=400" property="og:image" /><meta content="ageitgey/node-unfluff" property="og:title" /><meta content="https://github.com/ageitgey/node-unfluff" property="og:url" /><meta content="node-unfluff - Automatically extract body content (and other cool stuff) from an html document" property="og:description" /> <link rel="assets" href="https://assets-cdn.github.com/"> <link rel="conduit-xhr" href="https://ghconduit.com:25035"> <meta name="msapplication-TileImage" content="/windows-tile.png" /> <meta name="msapplication-TileColor" content="#ffffff" /> <meta name="selected-link" value="repo_source" data-pjax-transient /> <meta name="google-analytics" content="UA-3769691-2"> <meta content="collector.githubapp.com" name="octolytics-host" /><meta content="collector-cdn.github.com" name="octolytics-script-host" /><meta content="github" name="octolytics-app-id" /><meta content="B8B6BE92:18CF:6D539F9:53C718B4" name="octolytics-dimension-request_id" /> <link rel="icon" type="image/x-icon" href="https://assets-cdn.github.com/favicon.ico" /> <meta content="authenticity_token" name="csrf-param" /> <meta content="blkX4Qfc+sw2H51eD1Qx8GuudX4aWpsuHvBXzx4/Sfcy/0uPSOTOmQwfVYQPi6o5aQYLqeKNTRguTGBpiLsDtg==" name="csrf-token" /> <link href="https://assets-cdn.github.com/assets/github-c534ad575b5bb8c3cc3dce9c571df7aa7400dbe9.css" media="all" rel="stylesheet" type="text/css" /> <link href="https://assets-cdn.github.com/assets/github2-84a1b6179d461213455892ab983182bc2052a7b5.css" media="all" rel="stylesheet" type="text/css" /> <meta http-equiv="x-pjax-version" content="ef2e8ad48b4c98b3a1a0065370258ac2"> <meta name="description" content="node-unfluff - Automatically extract body content (and other cool stuff) from an html document" /> <meta content="896692" name="octolytics-dimension-user_id" /><meta content="ageitgey" name="octolytics-dimension-user_login" /><meta content="21369952" name="octolytics-dimension-repository_id" /><meta content="ageitgey/node-unfluff" name="octolytics-dimension-repository_nwo" /><meta content="true" name="octolytics-dimension-repository_public" /><meta content="false" name="octolytics-dimension-repository_is_fork" /><meta content="21369952" name="octolytics-dimension-repository_network_root_id" /><meta content="ageitgey/node-unfluff" name="octolytics-dimension-repository_network_root_nwo" /> <link href="https://github.com/ageitgey/node-unfluff/commits/master.atom" rel="alternate" title="Recent Commits to node-unfluff:master" type="application/atom+xml" /> </head> <body class="logged_out env-production vis-public"> <a href="#start-of-content" tabindex="1" class="accessibility-aid js-skip-to-content">Skip to content</a> <div class="wrapper"> <div class="header header-logged-out"> <div class="container clearfix"> <a class="header-logo-wordmark" href="https://github.com/"> <span class="mega-octicon octicon-logo-github"></span> </a> <div class="header-actions"> <a class="button primary" href="/join">Sign up</a> <a class="button signin" href="/login?return_to=%2Fageitgey%2Fnode-unfluff">Sign in</a> </div> <div class="command-bar js-command-bar in-repository"> <ul class="top-nav"> <li class="explore"><a href="/explore">Explore</a></li> <li class="features"><a href="/features">Features</a></li> <li class="enterprise"><a href="https://enterprise.github.com/">Enterprise</a></li> <li class="blog"><a href="/blog">Blog</a></li> </ul> <form accept-charset="UTF-8" action="/search" class="command-bar-form" id="top_search_form" method="get"> <div class="commandbar"> <span class="message"></span> <input type="text" data-hotkey="s, /" name="q" id="js-command-bar-field" placeholder="Search or type a command" tabindex="1" autocapitalize="off" data-repo="ageitgey/node-unfluff" > <div class="display hidden"></div> </div> <input type="hidden" name="nwo" value="ageitgey/node-unfluff" /> <div class="select-menu js-menu-container js-select-menu search-context-select-menu"> <span class="minibutton select-menu-button js-menu-target" role="button" aria-haspopup="true"> <span class="js-select-button">This repository</span> </span> <div class="select-menu-modal-holder js-menu-content js-navigation-container" aria-hidden="true"> <div class="select-menu-modal"> <div class="select-menu-item js-navigation-item js-this-repository-navigation-item selected"> <span class="select-menu-item-icon octicon octicon-check"></span> <input type="radio" class="js-search-this-repository" name="search_target" value="repository" checked="checked" /> <div class="select-menu-item-text js-select-button-text">This repository</div> </div> <!-- /.select-menu-item --> <div class="select-menu-item js-navigation-item js-all-repositories-navigation-item"> <span class="select-menu-item-icon octicon octicon-check"></span> <input type="radio" name="search_target" value="global" /> <div class="select-menu-item-text js-select-button-text">All repositories</div> </div> <!-- /.select-menu-item --> </div> </div> </div> <span class="help tooltipped tooltipped-s" aria-label="Show command bar help"> <span class="octicon octicon-question"></span> </span> <input type="hidden" name="ref" value="cmdform"> </form> </div> </div> </div> <div id="start-of-content" class="accessibility-aid"></div> <div class="site" itemscope itemtype="http://schema.org/WebPage"> <div id="js-flash-container"> </div> <div class="pagehead repohead instapaper_ignore readability-menu"> <div class="container"> <ul class="pagehead-actions"> <li> <a href="/login?return_to=%2Fageitgey%2Fnode-unfluff" class="minibutton with-count star-button tooltipped tooltipped-n" aria-label="You must be signed in to star a repository" rel="nofollow"> <span class="octicon octicon-star"></span> Star </a> <a class="social-count js-social-count" href="/ageitgey/node-unfluff/stargazers"> 668 </a> </li> <li> <a href="/login?return_to=%2Fageitgey%2Fnode-unfluff" class="minibutton with-count js-toggler-target fork-button tooltipped tooltipped-n" aria-label="You must be signed in to fork a repository" rel="nofollow"> <span class="octicon octicon-repo-forked"></span> Fork </a> <a href="/ageitgey/node-unfluff/network" class="social-count"> 23 </a> </li> </ul> <h1 itemscope itemtype="http://data-vocabulary.org/Breadcrumb" class="entry-title public"> <span class="repo-label"><span>public</span></span> <span class="mega-octicon octicon-repo"></span> <span class="author"><a href="/ageitgey" class="url fn" itemprop="url" rel="author"><span itemprop="title">ageitgey</span></a></span><!-- --><span class="path-divider">/</span><!-- --><strong><a href="/ageitgey/node-unfluff" class="js-current-repository js-repo-home-link">node-unfluff</a></strong> <span class="page-context-loader"> <img alt="" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </span> </h1> </div><!-- /.container --> </div><!-- /.repohead --> <div class="container"> <div class="repository-with-sidebar repo-container new-discussion-timeline js-new-discussion-timeline with-full-navigation "> <div class="repository-sidebar clearfix"> <div class="sunken-menu vertical-right repo-nav js-repo-nav js-repository-container-pjax js-octicon-loaders"> <div class="sunken-menu-contents"> <ul class="sunken-menu-group"> <li class="tooltipped tooltipped-w" aria-label="Code"> <a href="/ageitgey/node-unfluff" aria-label="Code" class="selected js-selected-navigation-item sunken-menu-item" data-hotkey="g c" data-pjax="true" data-selected-links="repo_source repo_downloads repo_commits repo_releases repo_tags repo_branches /ageitgey/node-unfluff"> <span class="octicon octicon-code"></span> <span class="full-word">Code</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> <li class="tooltipped tooltipped-w" aria-label="Issues"> <a href="/ageitgey/node-unfluff/issues" aria-label="Issues" class="js-selected-navigation-item sunken-menu-item js-disable-pjax" data-hotkey="g i" data-selected-links="repo_issues /ageitgey/node-unfluff/issues"> <span class="octicon octicon-issue-opened"></span> <span class="full-word">Issues</span> <span class='counter'>1</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> <li class="tooltipped tooltipped-w" aria-label="Pull Requests"> <a href="/ageitgey/node-unfluff/pulls" aria-label="Pull Requests" class="js-selected-navigation-item sunken-menu-item js-disable-pjax" data-hotkey="g p" data-selected-links="repo_pulls /ageitgey/node-unfluff/pulls"> <span class="octicon octicon-git-pull-request"></span> <span class="full-word">Pull Requests</span> <span class='counter'>0</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> </ul> <div class="sunken-menu-separator"></div> <ul class="sunken-menu-group"> <li class="tooltipped tooltipped-w" aria-label="Pulse"> <a href="/ageitgey/node-unfluff/pulse" aria-label="Pulse" class="js-selected-navigation-item sunken-menu-item" data-pjax="true" data-selected-links="pulse /ageitgey/node-unfluff/pulse"> <span class="octicon octicon-pulse"></span> <span class="full-word">Pulse</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> <li class="tooltipped tooltipped-w" aria-label="Graphs"> <a href="/ageitgey/node-unfluff/graphs" aria-label="Graphs" class="js-selected-navigation-item sunken-menu-item" data-pjax="true" data-selected-links="repo_graphs repo_contributors /ageitgey/node-unfluff/graphs"> <span class="octicon octicon-graph"></span> <span class="full-word">Graphs</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> <li class="tooltipped tooltipped-w" aria-label="Network"> <a href="/ageitgey/node-unfluff/network" aria-label="Network" class="js-selected-navigation-item sunken-menu-item js-disable-pjax" data-selected-links="repo_network /ageitgey/node-unfluff/network"> <span class="octicon octicon-repo-forked"></span> <span class="full-word">Network</span> <img alt="" class="mini-loader" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </a> </li> </ul> </div> </div> <div class="only-with-full-nav"> <div class="clone-url open" data-protocol-type="http" data-url="/users/set_protocol?protocol_selector=http&amp;protocol_type=clone"> <h3><strong>HTTPS</strong> clone URL</h3> <div class="clone-url-box"> <input type="text" class="clone js-url-field" value="https://github.com/ageitgey/node-unfluff.git" readonly="readonly"> <span class="url-box-clippy"> <button aria-label="Copy to clipboard" class="js-zeroclipboard minibutton zeroclipboard-button" data-clipboard-text="https://github.com/ageitgey/node-unfluff.git" data-copied-hint="Copied!" type="button"><span class="octicon octicon-clippy"></span></button> </span> </div> </div> <div class="clone-url " data-protocol-type="subversion" data-url="/users/set_protocol?protocol_selector=subversion&amp;protocol_type=clone"> <h3><strong>Subversion</strong> checkout URL</h3> <div class="clone-url-box"> <input type="text" class="clone js-url-field" value="https://github.com/ageitgey/node-unfluff" readonly="readonly"> <span class="url-box-clippy"> <button aria-label="Copy to clipboard" class="js-zeroclipboard minibutton zeroclipboard-button" data-clipboard-text="https://github.com/ageitgey/node-unfluff" data-copied-hint="Copied!" type="button"><span class="octicon octicon-clippy"></span></button> </span> </div> </div> <p class="clone-options">You can clone with <a href="#" class="js-clone-selector" data-protocol="http">HTTPS</a> or <a href="#" class="js-clone-selector" data-protocol="subversion">Subversion</a>. <a href="https://help.github.com/articles/which-remote-url-should-i-use" class="help tooltipped tooltipped-n" aria-label="Get help on which URL is right for you."> <span class="octicon octicon-question"></span> </a> </p> <a href="/ageitgey/node-unfluff/archive/master.zip" class="minibutton sidebar-button" aria-label="Download ageitgey/node-unfluff as a zip file" title="Download ageitgey/node-unfluff as a zip file" rel="nofollow"> <span class="octicon octicon-cloud-download"></span> Download ZIP </a> </div> </div><!-- /.repository-sidebar --> <div id="js-repo-pjax-container" class="repository-content context-loader-container" data-pjax-container> <span id="js-show-full-navigation"></span> <div class="repository-meta js-details-container "> <div class="repository-description js-details-show"> <p>Automatically extract body content (and other cool stuff) from an html document</p> </div> </div> <div class="capped-box overall-summary "> <div class="stats-switcher-viewport js-stats-switcher-viewport"> <div class="stats-switcher-wrapper"> <ul class="numbers-summary"> <li class="commits"> <a data-pjax href="/ageitgey/node-unfluff/commits/master"> <span class="num"> <span class="octicon octicon-history"></span> 26 </span> commits </a> </li> <li> <a data-pjax href="/ageitgey/node-unfluff/branches"> <span class="num"> <span class="octicon octicon-git-branch"></span> 2 </span> branches </a> </li> <li> <a data-pjax href="/ageitgey/node-unfluff/releases"> <span class="num"> <span class="octicon octicon-tag"></span> 4 </span> releases </a> </li> <li> <a href="/ageitgey/node-unfluff/graphs/contributors"> <span class="num"> <span class="octicon octicon-organization"></span> 1 </span> contributor </a> </li> </ul> <div class="repository-lang-stats"> <ol class="repository-lang-stats-numbers"> <li> <a href="/ageitgey/node-unfluff/search?l=coffeescript"> <span class="color-block language-color" style="background-color:#244776;"></span> <span class="lang">CoffeeScript</span> <span class="percent">99.8%</span> </a> </li> <li> <a href="/ageitgey/node-unfluff/search?l=javascript"> <span class="color-block language-color" style="background-color:#f1e05a;"></span> <span class="lang">JavaScript</span> <span class="percent">0.2%</span> </a> </li> </ol> </div> </div> </div> </div> <div class="tooltipped tooltipped-s" aria-label="Show language statistics"> <a href="#" class="repository-lang-stats-graph js-toggle-lang-stats" style="background-color:#f1e05a"> <span class="language-color" style="width:99.8%; background-color:#244776;" itemprop="keywords">CoffeeScript</span><span class="language-color" style="width:0.2%; background-color:#f1e05a;" itemprop="keywords">JavaScript</span> </a> </div> <div class="file-navigation in-mid-page"> <a href="/ageitgey/node-unfluff/find/master" class="js-show-file-finder minibutton empty-icon tooltipped tooltipped-s right" data-pjax data-hotkey="t" aria-label="Quickly jump between files"> <span class="octicon octicon-list-unordered"></span> </a> <a href="/ageitgey/node-unfluff/compare" aria-label="Compare, review, create a pull request" class="minibutton compact primary tooltipped tooltipped-s" aria-label="Compare &amp; review" data-pjax> <span class="octicon octicon-git-compare"></span> </a> <div class="select-menu js-menu-container js-select-menu" > <span class="minibutton select-menu-button js-menu-target css-truncate" data-hotkey="w" data-master-branch="master" data-ref="master" title="master" role="button" aria-label="Switch branches or tags" tabindex="0" aria-haspopup="true"> <span class="octicon octicon-git-branch"></span> <i>branch:</i> <span class="js-select-button css-truncate-target">master</span> </span> <div class="select-menu-modal-holder js-menu-content js-navigation-container" data-pjax aria-hidden="true"> <div class="select-menu-modal"> <div class="select-menu-header"> <span class="select-menu-title">Switch branches/tags</span> <span class="octicon octicon-x js-menu-close" role="button" aria-label="Close"></span> </div> <!-- /.select-menu-header --> <div class="select-menu-filters"> <div class="select-menu-text-filter"> <input type="text" aria-label="Filter branches/tags" id="context-commitish-filter-field" class="js-filterable-field js-navigation-enable" placeholder="Filter branches/tags"> </div> <div class="select-menu-tabs"> <ul> <li class="select-menu-tab"> <a href="#" data-tab-filter="branches" class="js-select-menu-tab">Branches</a> </li> <li class="select-menu-tab"> <a href="#" data-tab-filter="tags" class="js-select-menu-tab">Tags</a> </li> </ul> </div><!-- /.select-menu-tabs --> </div><!-- /.select-menu-filters --> <div class="select-menu-list select-menu-tab-bucket js-select-menu-tab-bucket" data-tab-filter="branches"> <div data-filterable-for="context-commitish-filter-field" data-filterable-type="substring"> <div class="select-menu-item js-navigation-item "> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/ag-fix-sec-pages" data-name="ag-fix-sec-pages" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="ag-fix-sec-pages">ag-fix-sec-pages</a> </div> <!-- /.select-menu-item --> <div class="select-menu-item js-navigation-item selected"> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/master" data-name="master" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="master">master</a> </div> <!-- /.select-menu-item --> </div> <div class="select-menu-no-results">Nothing to show</div> </div> <!-- /.select-menu-list --> <div class="select-menu-list select-menu-tab-bucket js-select-menu-tab-bucket" data-tab-filter="tags"> <div data-filterable-for="context-commitish-filter-field" data-filterable-type="substring"> <div class="select-menu-item js-navigation-item "> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/v0.3.0" data-name="v0.3.0" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="v0.3.0">v0.3.0</a> </div> <!-- /.select-menu-item --> <div class="select-menu-item js-navigation-item "> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/v0.2.0" data-name="v0.2.0" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="v0.2.0">v0.2.0</a> </div> <!-- /.select-menu-item --> <div class="select-menu-item js-navigation-item "> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/v0.1.0" data-name="v0.1.0" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="v0.1.0">v0.1.0</a> </div> <!-- /.select-menu-item --> <div class="select-menu-item js-navigation-item "> <span class="select-menu-item-icon octicon octicon-check"></span> <a href="/ageitgey/node-unfluff/tree/v0.0.2" data-name="v0.0.2" data-skip-pjax="true" rel="nofollow" class="js-navigation-open select-menu-item-text css-truncate-target" title="v0.0.2">v0.0.2</a> </div> <!-- /.select-menu-item --> </div> <div class="select-menu-no-results">Nothing to show</div> </div> <!-- /.select-menu-list --> </div> <!-- /.select-menu-modal --> </div> <!-- /.select-menu-modal-holder --> </div> <!-- /.select-menu --> <div class="breadcrumb"><span class='repo-root js-repo-root'><span itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a href="/ageitgey/node-unfluff" data-branch="master" data-direction="back" data-pjax="true" itemscope="url"><span itemprop="title">node-unfluff</span></a></span></span><span class="separator"> / </span><form action="/login?return_to=%2Fageitgey%2Fnode-unfluff" aria-label="Sign in to make or propose changes" class="js-new-blob-form tooltipped tooltipped-e new-file-link" method="post"><span aria-label="Sign in to make or propose changes" class="js-new-blob-submit octicon octicon-plus" data-test-id="create-new-git-file" role="button"></span></form></div> </div> <div class="commit commit-tease js-details-container" > <p class="commit-title "> <a href="/ageitgey/node-unfluff/commit/8138584dcc4e7119df2c96f8ab60573aed163d82" class="message" data-pjax="true" title="Update CHANGELOG.md">Update CHANGELOG.md</a> </p> <div class="commit-meta"> <button aria-label="Copy SHA" class="js-zeroclipboard zeroclipboard-link" data-clipboard-text="8138584dcc4e7119df2c96f8ab60573aed163d82" data-copied-hint="Copied!" type="button"><span class="octicon octicon-clippy"></span></button> <a href="/ageitgey/node-unfluff/commit/8138584dcc4e7119df2c96f8ab60573aed163d82" class="sha-block" data-pjax>latest commit <span class="sha">8138584dcc</span></a> <div class="authorship"> <img alt="" class="gravatar js-avatar" data-user="896692" height="20" src="https://avatars2.githubusercontent.com/u/896692?s=140" width="20" /> <span class="author-name"><a href="/ageitgey" rel="author">ageitgey</a></span> authored <time class="updated" datetime="2014-07-13T22:17:32-07:00" is="relative-time">July 13, 2014</time> </div> </div> </div> <div class="file-wrap"> <table class="files" data-pjax> <tbody class="" data-url="/ageitgey/node-unfluff/file-list/master" data-deferred-content-error="Failed to load latest commit information."> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/bin" class="js-directory-link" id="c1111bd512b29e821b120b86446026b8-fbfcca3c6847948f65bd44d5d188f5a283aaa181" title="bin">bin</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/818fd49ad6f268c96ca0eae52eccb758f30f406e" class="message" data-pjax="true" title="initial checkin of 0.0.1">initial checkin of 0.0.1</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-05T00:46:39Z" is="time-ago">July 04, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/data" class="js-directory-link" id="8d777f385d3dfec8815d20f7496026dc-66f88b6f0dfb3fe89db43e5f0fdfadb98187aeee" title="data">data</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/818fd49ad6f268c96ca0eae52eccb758f30f406e" class="message" data-pjax="true" title="initial checkin of 0.0.1">initial checkin of 0.0.1</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-05T00:46:39Z" is="time-ago">July 05, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/fixtures" class="js-directory-link" id="9403e5114acb6bb59791a97291be54b5-02e959156a838220e40ddb916d3bfd6fc3f9f7e4" title="fixtures">fixtures</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/bc49f9a04d2c330d9deda514c10cdc88410400f6" class="message" data-pjax="true" title="Fix issue parsing SEC webpages due to junky line breaks and bad u tags">Fix issue parsing SEC webpages due to junky line breaks and bad u tags</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-13T00:32:31Z" is="time-ago">July 12, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/lib" class="js-directory-link" id="e8acc63b1e238f3255c900eed37254b8-dfb865def3238059e3e1a6ea9efd10636f34d768" title="lib">lib</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/5e4ecc2fc14e07e20a9ba038d4a1ccbc421fa8bd" class="message" data-pjax="true" title="Update .js files with lodash">Update .js files with lodash</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-13T00:42:44Z" is="time-ago">July 12, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/src" class="js-directory-link" id="25d902c24283ab8cfbac54dfa101ad31-dcc0256e929cc02b62bfbf510a97d3b9adfc5dae" title="src">src</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/97d2a36a51f84bd08478a29f6c99e3405e13e420" class="message" data-pjax="true" title="Switch from underscore to lodash">Switch from underscore to lodash</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-13T00:36:39Z" is="time-ago">July 12, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-directory"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/tree/master/test" class="js-directory-link" id="098f6bcd4621d373cade4e832627b4f6-ecc84ab6ab8bee3f8b1355a6e376114da292ddd1" title="test">test</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/3e48753c77364eea7076f404fd92b8c9c802e45d" class="message" data-pjax="true" title="Update tests to swtich from underscore to lodash">Update tests to swtich from underscore to lodash</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-13T00:39:27Z" is="time-ago">July 12, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/.gitignore" class="js-directory-link" id="a084b794bc0759e7a6b77810e01874f2-922d9c70a456b8470558f7c05aa6e18f2e42834a" title=".gitignore">.gitignore</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/818fd49ad6f268c96ca0eae52eccb758f30f406e" class="message" data-pjax="true" title="initial checkin of 0.0.1">initial checkin of 0.0.1</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-05T00:46:39Z" is="time-ago">July 05, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/.travis.yml" class="js-directory-link" id="354f30a63fb0907d4ad57269548329e3-05d299e676449634733c2040c619a704e735e49a" title=".travis.yml">.travis.yml</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/fba5f505858f5ba0bae10272e99737d9cc2e3413" class="message" data-pjax="true" title="adding travisci config">adding travisci config</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-05T01:28:00Z" is="time-ago">July 04, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/CHANGELOG.md" class="js-directory-link" id="4ac32a78649ca5bdd8e0ba38b7006a1e-8b336b56932edaaca33a616a80202281a277e2aa" title="CHANGELOG.md">CHANGELOG.md</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/8138584dcc4e7119df2c96f8ab60573aed163d82" class="message" data-pjax="true" title="Update CHANGELOG.md">Update CHANGELOG.md</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-14T05:17:32Z" is="time-ago">July 13, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/LICENSE" class="js-directory-link" id="9879d6db96fd29134fc802214163b95a-ad410e11302107da9aa47ce3d46bd5ad011c4c43" title="LICENSE">LICENSE</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/95897afcf6be41a8d97074b128dbcd88e3ef8a0a" class="message" data-pjax="true" title="Initial commit">Initial commit</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-01T00:09:17Z" is="time-ago">June 30, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/Makefile" class="js-directory-link" id="b67911656ef5d18c4ae36cb6741b7965-557d91debef0b1d57044d073506519eb79148cce" title="Makefile">Makefile</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/818fd49ad6f268c96ca0eae52eccb758f30f406e" class="message" data-pjax="true" title="initial checkin of 0.0.1">initial checkin of 0.0.1</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-05T00:46:39Z" is="time-ago">July 05, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/README.md" class="js-directory-link" id="04c6e90faac2675aa89e2176d2eec7d8-53c88c6daa15e80e5f496eaa839a09723cee8fb5" title="README.md">README.md</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/e5bc56f495370afa52948ca18a5848d777cf63a8" class="message" data-pjax="true" title="Link to fetchtext">Link to fetchtext</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-07T17:15:13Z" is="time-ago">July 07, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/package.json" class="js-directory-link" id="b9cfc7f2cdf78a7f4b91a753d10865a2-41e8d720786da1b1125e3e68e75b37dc5e7d7261" title="package.json">package.json</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/c6a57723711b8738da90cfe882e38d8f9dda7af0" class="message" data-pjax="true" title="Version 0.3.0">Version 0.3.0</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-13T00:47:12Z" is="time-ago">July 12, 2014</time></span> </td> </tr> <tr> <td class="icon"> <span class="octicon octicon-file-text"></span> <img alt="" class="spinner" height="16" src="https://assets-cdn.github.com/images/spinners/octocat-spinner-32.gif" width="16" /> </td> <td class="content"> <span class="css-truncate css-truncate-target"><a href="/ageitgey/node-unfluff/blob/master/test-setup.coffee" class="js-directory-link" id="bf8e81c77bae182379cf8f1ee6d2f83d-7f642b6a111f0f4f540045838b50d78ea9362813" title="test-setup.coffee">test-setup.coffee</a></span> </td> <td class="message"> <span class="css-truncate css-truncate-target "> <a href="/ageitgey/node-unfluff/commit/5e0a3a2a4d5e572a46a070a800fa63fac1461af2" class="message" data-pjax="true" title="Add support for extracting embedded videos from web pages">Add support for extracting embedded videos from web pages</a> </span> </td> <td class="age"> <span class="css-truncate css-truncate-target"><time datetime="2014-07-06T01:22:55Z" is="time-ago">July 05, 2014</time></span> </td> </tr> </tbody> </table> </div> <div id="readme" class="clearfix announce instapaper_body md"> <span class="name"> <span class="octicon octicon-book"></span> README.md </span> <article class="markdown-body entry-content" itemprop="mainContentOfPage"><h1> <a name="user-content-unfluff" class="anchor" href="#unfluff" aria-hidden="true"><span class="octicon octicon-link"></span></a>unfluff</h1> <p>An automatic web page content extractor for Node.js!</p> <p><a href="https://travis-ci.org/ageitgey/node-unfluff"><img src="https://camo.githubusercontent.com/1db440b56d8feeb36473c63824dff64d3a4da92a/68747470733a2f2f7472617669732d63692e6f72672f61676569746765792f6e6f64652d756e666c7566662e7376673f6272616e63683d6d6173746572" alt="Build Status" data-canonical-src="https://travis-ci.org/ageitgey/node-unfluff.svg?branch=master" style="max-width:100%;"></a></p> <p>Automatically grab the main text out of a webpage like this:</p> <pre><code>extractor = require('unfluff'); data = extractor(my_html_data); console.log(data.text); </code></pre> <p>In other words, it turns pretty webpages into boring plain text/json data:</p> <p><a href="https://cloud.githubusercontent.com/assets/896692/3478577/b82f39cc-033d-11e4-9e68-226c9a7bc1c0.jpg" target="_blank"><img src="https://cloud.githubusercontent.com/assets/896692/3478577/b82f39cc-033d-11e4-9e68-226c9a7bc1c0.jpg" alt="" style="max-width:100%;"></a></p> <p>This might be useful for:</p> <ul class="task-list"> <li>Writing your own Instapaper clone</li> <li>Easily building ML data sets from web pages</li> <li>Reading your favorite articles from the console?</li> </ul><p>Please don't use this for:</p> <ul class="task-list"> <li>Stealing other peoples' web pages</li> <li>Making crappy spam sites with stolen content from other sites</li> <li>Being a jerk</li> </ul><h2> <a name="user-content-credits--thanks" class="anchor" href="#credits--thanks" aria-hidden="true"><span class="octicon octicon-link"></span></a>Credits / Thanks</h2> <p>This library is largely based on <a href="https://github.com/grangier/python-goose">python-goose</a> by <a href="https://github.com/grangier">Xavier Grangier</a> which is in turn based on <a href="https://github.com/GravityLabs/goose">goose</a> by <a href="https://github.com/GravityLabs">Gravity Labs</a>. However, it's not an exact port so it may behave differently on some pages and the feature set is a little bit different. If you are looking for a python or Scala/Java/JVM solution, check out those libraries!</p> <h2> <a name="user-content-install" class="anchor" href="#install" aria-hidden="true"><span class="octicon octicon-link"></span></a>Install</h2> <p>To install the command-line <code>unfluff</code> utility:</p> <pre><code>npm install -g unfluff </code></pre> <p>To install the <code>unfluff</code> module for use in your Node.js project:</p> <pre><code>npm install --save unfluff </code></pre> <h2> <a name="user-content-usage" class="anchor" href="#usage" aria-hidden="true"><span class="octicon octicon-link"></span></a>Usage</h2> <p>You can use <code>unfluff</code> from node or right on the command line!</p> <h3> <a name="user-content-extracted-data-elements" class="anchor" href="#extracted-data-elements" aria-hidden="true"><span class="octicon octicon-link"></span></a>Extracted data elements</h3> <p>This is what <code>unfluff</code> will try to grab from a web page:</p> <ul class="task-list"> <li> <code>title</code> - The document's title (from the &lt;title&gt; tag)</li> <li> <code>text</code> - The main text of the document with all the junk thrown away</li> <li> <code>image</code> - The main image for the document (what's use by facebook, etc.)</li> <li> <code>videos</code> - An array of videos that were embedded in the article. Each video has src, width and height.</li> <li> <code>tags</code>- Any tags or keywords that could be found by checking &lt;rel&gt; tags or by looking at href urls.</li> <li> <code>canonicalLink</code> - The <a href="https://support.google.com/webmasters/answer/139066?hl=en">canonical url</a> of the document, if given.</li> <li> <code>lang</code> - The language of the document, either detected or supplied by you.</li> <li> <code>description</code> - The description of the document, from &lt;meta&gt; tags</li> <li> <code>favicon</code> - The url of the document's <a href="http://en.wikipedia.org/wiki/Favicon">favicon</a>.</li> </ul><p>This is returned as a simple json object.</p> <h3> <a name="user-content-command-line-interface" class="anchor" href="#command-line-interface" aria-hidden="true"><span class="octicon octicon-link"></span></a>Command line interface</h3> <p>You can pass a webpage to unfluff and it will try to parse out the interesting bits.</p> <p>You can either pass in a file name:</p> <pre><code>unfluff my_file.html </code></pre> <p>Or you can pipe it in:</p> <pre><code>curl -s "http://somesite.com/page" | unfluff </code></pre> <p>You can easily chain this together with other unix commands to do cool stuff. For example, you can download a web page, parse it and then use <a href="http://stedolan.github.io/jq/">jq</a> to print it just the body text.</p> <pre><code>curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | jq -r .text </code></pre> <p>And here's how to find the top 10 most common words in an article:</p> <pre><code>curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -10 </code></pre> <h3> <a name="user-content-module-interface" class="anchor" href="#module-interface" aria-hidden="true"><span class="octicon octicon-link"></span></a>Module Interface</h3> <h4> <a name="user-content-extractorhtml-language" class="anchor" href="#extractorhtml-language" aria-hidden="true"><span class="octicon octicon-link"></span></a><code>extractor(html, language)</code> </h4> <p>html: The html you want to parse</p> <p>language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.</p> <p>The extraction algorithm depends heavily on the language, so it probably won't work if you have the language set incorrectly.</p> <div class="highlight highlight-javascript"><pre><span class="nx">extractor</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">'unfluff'</span><span class="p">);</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extractor</span><span class="p">(</span><span class="nx">my_html_data</span><span class="p">);</span> </pre></div> <p>Or supply the language code yourself:</p> <div class="highlight highlight-javascript"><pre><span class="nx">extractor</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">'unfluff'</span><span class="p">,</span> <span class="s1">'en'</span><span class="p">);</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">extractor</span><span class="p">(</span><span class="nx">my_html_data</span><span class="p">);</span> </pre></div> <p><code>data</code> will then be a json object that looks like this:</p> <div class="highlight highlight-json"><pre><span class="p">{</span> <span class="nt">"title"</span><span class="p">:</span> <span class="s2">"Shovel Knight review: rewrite history"</span><span class="p">,</span> <span class="nt">"text"</span><span class="p">:</span> <span class="s2">"Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]"</span><span class="p">,</span> <span class="nt">"image"</span><span class="p">:</span> <span class="s2">"http://cdn2.vox-cdn.com/uploads/chorus_image/image/34834129/jellyfish_hero.0_cinema_1280.0.png"</span><span class="p">,</span> <span class="nt">"tags"</span><span class="p">:</span> <span class="p">[],</span> <span class="nt">"videos"</span><span class="p">:</span> <span class="p">[],</span> <span class="nt">"canonicalLink"</span><span class="p">:</span> <span class="s2">"http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u"</span><span class="p">,</span> <span class="nt">"lang"</span><span class="p">:</span> <span class="s2">"en"</span><span class="p">,</span> <span class="nt">"description"</span><span class="p">:</span> <span class="s2">"Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it."</span><span class="p">,</span> <span class="nt">"favicon"</span><span class="p">:</span> <span class="s2">"http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico"</span> <span class="p">}</span> </pre></div> <h3> <a name="user-content-demo" class="anchor" href="#demo" aria-hidden="true"><span class="octicon octicon-link"></span></a>Demo</h3> <p>The easiest way to try out <code>unfluff</code> is to just install it:</p> <pre><code>$ npm install -g unfluff $ curl -s "http://www.cnn.com/2014/07/07/world/americas/mexico-earthquake/index.html" | unfluff </code></pre> <p>But if you can't be bothered, you can check out <a href="http://fetchtext.herokuapp.com/">fetch text</a>. It's a site by <a href="https://twitter.com/andyjiang">Andy Jiang</a> that uses <code>unfluff</code>. You send an email with a url and it emails back with the cleaned content of that url. It should give you a good idea of how <code>unfluff</code> handles different urls.</p> <h3> <a name="