easy_web_crawler
Version:
Web crawler wrapper around puppeteer module to simply the crawling on ajax/java script enabled pages.
1,129 lines (486 loc) • 28.4 kB
HTML
<html>
<head>
<meta charset='utf-8' />
<title>easy_web_crawler 1.0.5 | Documentation</title>
<meta name='viewport' content='width=device-width,initial-scale=1'>
<link href='assets/bass.css' type='text/css' rel='stylesheet' />
<link href='assets/style.css' type='text/css' rel='stylesheet' />
<link href='assets/github.css' type='text/css' rel='stylesheet' />
<link href='assets/split.css' type='text/css' rel='stylesheet' />
<meta name='description' content='Web crawler wrapper around puppeteer module to simply the crawling on ajax/java script enabled pages.'>
</head>
<body class='documentation m0'>
<div class='flex'>
<div id='split-left' class='overflow-auto fs0 height-viewport-100'>
<div class='py1 px2'>
<h3 class='mb0 no-anchor'>easy_web_crawler</h3>
<div class='mb1'><code>1.0.5</code></div>
<input
placeholder='Filter'
id='filter-input'
class='col12 block input'
type='text' />
<div id='toc'>
<ul class='list-reset h5 py1-ul'>
<li><a
href='#scaper'
class=" toggle-sibling">
Scaper
<span class='icon'>▸</span>
</a>
<div class='toggle-target display-none'>
<ul class='list-reset py1-ul pl1'>
<li class='h5'><span>Instance members</span></li>
<li><a
href='#scaperstartwithurls'
class='regular pre-open'>
#startWithURLs
</a></li>
<li><a
href='#scaperallowifmatches'
class='regular pre-open'>
#allowIfMatches
</a></li>
<li><a
href='#scapersaveprogressinfile'
class='regular pre-open'>
#saveProgressInFile
</a></li>
<li><a
href='#scaperenableautocrawler'
class='regular pre-open'>
#enableAutoCrawler
</a></li>
<li><a
href='#scaperwaitbetweenpageload'
class='regular pre-open'>
#waitBetweenPageLoad
</a></li>
<li><a
href='#scapercallbackonfinish'
class='regular pre-open'>
#callbackOnFinish
</a></li>
<li><a
href='#scapercallbackonpageload'
class='regular pre-open'>
#callbackOnPageLoad
</a></li>
<li><a
href='#scaperstart'
class='regular pre-open'>
#start
</a></li>
</ul>
</div>
</li>
<li><a
href='#page'
class=" toggle-sibling">
Page
<span class='icon'>▸</span>
</a>
<div class='toggle-target display-none'>
<ul class='list-reset py1-ul pl1'>
<li class='h5'><span>Instance members</span></li>
<li><a
href='#pagedownload_image'
class='regular pre-open'>
#download_image
</a></li>
<li><a
href='#pagesaveresult'
class='regular pre-open'>
#saveResult
</a></li>
<li><a
href='#pagewrite_text_to_file'
class='regular pre-open'>
#write_text_to_file
</a></li>
<li><a
href='#pageadd_url_to_queue'
class='regular pre-open'>
#add_url_to_queue
</a></li>
</ul>
</div>
</li>
</ul>
</div>
<div class='mt1 h6 quiet'>
<a href='http://documentation.js.org/reading-documentation.html'>Need help reading this?</a>
</div>
</div>
</div>
<div id='split-right' class='relative overflow-auto height-viewport-100'>
<section class='p2 mb2 clearfix bg-white minishadow'>
<div class='clearfix'>
<h3 class='fl m0' id='scaper'>
Scaper
</h3>
</div>
<p>Main Scraper class</p>
<div class='pre p1 fill-light mt0'>new Scaper()</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// npm install easy_web_crawler</span>
<span class="hljs-keyword">const</span> Scaper = <span class="hljs-built_in">require</span>(<span class="hljs-string">'easy_web_crawler'</span>)
<span class="hljs-keyword">var</span> scraper =<span class="hljs-keyword">new</span> Scraper();</pre>
<div class='py1 quiet mt1 prose-big'>Instance Members</div>
<div class="clearfix">
<div class='border-bottom' id='scaperstartwithurls'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>startWithURLs(listOfURLs)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>This is mandatory.<br>
Take the list of urls used as the starting point.</p>
<div class='pre p1 fill-light mt0'>startWithURLs(listOfURLs: (<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a> | <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Array">Array</a><<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>>))</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>listOfURLs</span> <code class='quiet'>((<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a> | <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Array">Array</a><<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>>))</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// add the urls as the starting point</span>
scaper.startWithURLs([<span class="hljs-string">'www.googl.com'</span>,<span class="hljs-string">'www.bing.com'</span>])
scaper.startWithURLs(<span class="hljs-string">'www.googl.com'</span>)</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scaperallowifmatches'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>allowIfMatches(nonAsyncFunction)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Takes a non async callback function as argument,url added to processing queue only if the function return true value.<br>
This is optional.
By default is accept all urls added to processing queue.<br></p>
<div class='pre p1 fill-light mt0'>allowIfMatches(nonAsyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>nonAsyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// accept url contains www.google.com</span>
scraper.allowIfMatches(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">url</span>) </span>{
<span class="hljs-keyword">return</span> url.indexOf(<span class="hljs-string">'www.google.com'</span>)><span class="hljs-number">-1</span>
})</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scapersaveprogressinfile'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>saveProgressInFile(filePath)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>This is optional setting.<br>
This will save your progress in the file and you can stop and start the scraper from the previous state.<br>
The file is a sqlite db file you can modify the content using sqllite clients.<br>
If no file specified the stored in memory..</p>
<div class='pre p1 fill-light mt0'>saveProgressInFile(filePath: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>filePath</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// state stored in state.db file</span>
scraper.saveProgressInFile(<span class="hljs-string">"./state.db"</span>)</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scaperenableautocrawler'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>enableAutoCrawler(flag, enableAutoCrawler)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>This will allow the scraper to automatically download all the links form the page and add to processing queue.<br>
Note the urls will be filtered if allowIfMatches function return 'false'.</p>
<div class='pre p1 fill-light mt0'>enableAutoCrawler(flag: any, enableAutoCrawler: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Boolean">boolean</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>flag</span> <code class='quiet'>(any)</code>
</div>
</div>
<div class='space-bottom0'>
<div>
<span class='code bold'>enableAutoCrawler</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Boolean">boolean</a>)</code>
true to enable
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.enableAutoCrawler(<span class="hljs-literal">true</span>)</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scaperwaitbetweenpageload'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>waitBetweenPageLoad(delayInMilliSeconds)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Time delay between each page load in milliseconds</p>
<div class='pre p1 fill-light mt0'>waitBetweenPageLoad(delayInMilliSeconds: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>delayInMilliSeconds</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>
= <code>0</code>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">//wait for 90 milliseconds between page load</span>
scraper.waitBetweenPageLoad(<span class="hljs-number">90</span>)</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scapercallbackonfinish'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>callbackOnFinish(asyncFunction)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Final callback when scarping is completed</p>
<div class='pre p1 fill-light mt0'>callbackOnFinish(asyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>asyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.callbackOnFinish(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">result</span>)</span>{
<span class="hljs-built_in">console</span>.log(result)
})</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scapercallbackonpageload'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>callbackOnPageLoad(asyncFunction)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>This is the main function.Your scarping logic to be defined in the function.<br>
This called for each page in the processing queue.<br>
Called with pupetter page object as input.<br>
The page object input got addtional methods to support scraping</p>
<div class='pre p1 fill-light mt0'>callbackOnPageLoad(asyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>asyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</code>
a sync function with single input argument page.
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.waitBetweenPageLoad(<span class="hljs-number">90</span>)</pre>
</section>
</div>
</div>
<div class='border-bottom' id='scaperstart'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>start()</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>To start the scraping process.
callbackOnFinish function is called once the scraping is completed.</p>
<div class='pre p1 fill-light mt0'>start()</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.start()</pre>
</section>
</div>
</div>
</div>
</section>
<section class='p2 mb2 clearfix bg-white minishadow'>
<div class='clearfix'>
<h3 class='fl m0' id='page'>
Page
</h3>
</div>
<p>Pupetter page class.
Enhanced with supporting function detailed below.</p>
<div class='pre p1 fill-light mt0'>new Page()</div>
<div class='py1 quiet mt1 prose-big'>Instance Members</div>
<div class="clearfix">
<div class='border-bottom' id='pagedownload_image'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>download_image(image_download_url, where_to_full_file_path)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Download image from url and save to local disk</p>
<div class='pre p1 fill-light mt0'>download_image(image_download_url: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>, where_to_full_file_path: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>image_download_url</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
<div class='space-bottom0'>
<div>
<span class='code bold'>where_to_full_file_path</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{
<span class="hljs-keyword">var</span> img = <span class="hljs-keyword">await</span> page.$(<span class="hljs-string">'img'</span>)
<span class="hljs-keyword">var</span> img_src = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function"><span class="hljs-params">img</span> =></span> img.getAttribute(<span class="hljs-string">"src"</span>), img);
page.download_image(img_src,<span class="hljs-string">"usr/test/profile.png"</span>)
})</pre>
</section>
</div>
</div>
<div class='border-bottom' id='pagesaveresult'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>saveResult(text)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Save the text result ,this will returned as input to callbackOnFinish function<br>
Each url can store one result</p>
<div class='pre p1 fill-light mt0'>saveResult(text: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>text</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{
<span class="hljs-keyword">var</span> article = <span class="hljs-keyword">await</span> page.$<span class="hljs-built_in">eval</span>(<span class="hljs-string">'article'</span>, tag => tag.innerText);
page.saveResult(article)
})</pre>
</section>
</div>
</div>
<div class='border-bottom' id='pagewrite_text_to_file'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>write_text_to_file(content, filename)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Write text content to local file</p>
<div class='pre p1 fill-light mt0'>write_text_to_file(content: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>, filename: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>content</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
<div class='space-bottom0'>
<div>
<span class='code bold'>filename</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{
<span class="hljs-keyword">var</span> article = <span class="hljs-keyword">await</span> page.$<span class="hljs-built_in">eval</span>(<span class="hljs-string">'article'</span>, tag => tag.innerText);
page.download_image(article,<span class="hljs-string">"usr/test/article.txt"</span>)
});</pre>
</section>
</div>
</div>
<div class='border-bottom' id='pageadd_url_to_queue'>
<div class="clearfix small pointer toggle-sibling">
<div class="py1 contain">
<a class='icon pin-right py1 dark-link caret-right'>▸</a>
<span class='code strong strong truncate'>add_url_to_queue(url)</span>
</div>
</div>
<div class="clearfix display-none toggle-target">
<section class='p2 mb2 clearfix bg-white minishadow'>
<p>Add the url to processing queue</p>
<div class='pre p1 fill-light mt0'>add_url_to_queue(url: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div>
<div class='py1 quiet mt1 prose-big'>Parameters</div>
<div class='prose'>
<div class='space-bottom0'>
<div>
<span class='code bold'>url</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code>
</div>
</div>
</div>
<div class='py1 quiet mt1 prose-big'>Example</div>
<pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{
<span class="hljs-keyword">var</span> a = <span class="hljs-keyword">await</span> page.$(<span class="hljs-string">'a'</span>)
<span class="hljs-keyword">var</span> url = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function"><span class="hljs-params">a</span> =></span> a.getAttribute(<span class="hljs-string">"href"</span>), a);
page.add_url_to_queue(url)
});</pre>
</section>
</div>
</div>
</div>
</section>
</div>
</div>
<script src='assets/anchor.js'></script>
<script src='assets/split.js'></script>
<script src='assets/site.js'></script>
</body>
</html>