UNPKG

easy_web_crawler

Version:

Web crawler wrapper around puppeteer module to simply the crawling on ajax/java script enabled pages.

1,129 lines (486 loc) 28.4 kB
<!doctype html> <html> <head> <meta charset='utf-8' /> <title>easy_web_crawler 1.0.5 | Documentation</title> <meta name='viewport' content='width=device-width,initial-scale=1'> <link href='assets/bass.css' type='text/css' rel='stylesheet' /> <link href='assets/style.css' type='text/css' rel='stylesheet' /> <link href='assets/github.css' type='text/css' rel='stylesheet' /> <link href='assets/split.css' type='text/css' rel='stylesheet' /> <meta name='description' content='Web crawler wrapper around puppeteer module to simply the crawling on ajax/java script enabled pages.'> </head> <body class='documentation m0'> <div class='flex'> <div id='split-left' class='overflow-auto fs0 height-viewport-100'> <div class='py1 px2'> <h3 class='mb0 no-anchor'>easy_web_crawler</h3> <div class='mb1'><code>1.0.5</code></div> <input placeholder='Filter' id='filter-input' class='col12 block input' type='text' /> <div id='toc'> <ul class='list-reset h5 py1-ul'> <li><a href='#scaper' class=" toggle-sibling"> Scaper <span class='icon'></span> </a> <div class='toggle-target display-none'> <ul class='list-reset py1-ul pl1'> <li class='h5'><span>Instance members</span></li> <li><a href='#scaperstartwithurls' class='regular pre-open'> #startWithURLs </a></li> <li><a href='#scaperallowifmatches' class='regular pre-open'> #allowIfMatches </a></li> <li><a href='#scapersaveprogressinfile' class='regular pre-open'> #saveProgressInFile </a></li> <li><a href='#scaperenableautocrawler' class='regular pre-open'> #enableAutoCrawler </a></li> <li><a href='#scaperwaitbetweenpageload' class='regular pre-open'> #waitBetweenPageLoad </a></li> <li><a href='#scapercallbackonfinish' class='regular pre-open'> #callbackOnFinish </a></li> <li><a href='#scapercallbackonpageload' class='regular pre-open'> #callbackOnPageLoad </a></li> <li><a href='#scaperstart' class='regular pre-open'> #start </a></li> </ul> </div> </li> <li><a href='#page' class=" toggle-sibling"> Page <span class='icon'></span> </a> <div class='toggle-target display-none'> <ul class='list-reset py1-ul pl1'> <li class='h5'><span>Instance members</span></li> <li><a href='#pagedownload_image' class='regular pre-open'> #download_image </a></li> <li><a href='#pagesaveresult' class='regular pre-open'> #saveResult </a></li> <li><a href='#pagewrite_text_to_file' class='regular pre-open'> #write_text_to_file </a></li> <li><a href='#pageadd_url_to_queue' class='regular pre-open'> #add_url_to_queue </a></li> </ul> </div> </li> </ul> </div> <div class='mt1 h6 quiet'> <a href='http://documentation.js.org/reading-documentation.html'>Need help reading this?</a> </div> </div> </div> <div id='split-right' class='relative overflow-auto height-viewport-100'> <section class='p2 mb2 clearfix bg-white minishadow'> <div class='clearfix'> <h3 class='fl m0' id='scaper'> Scaper </h3> </div> <p>Main Scraper class</p> <div class='pre p1 fill-light mt0'>new Scaper()</div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// npm install easy_web_crawler</span> <span class="hljs-keyword">const</span> Scaper = <span class="hljs-built_in">require</span>(<span class="hljs-string">'easy_web_crawler'</span>) <span class="hljs-keyword">var</span> scraper =<span class="hljs-keyword">new</span> Scraper();</pre> <div class='py1 quiet mt1 prose-big'>Instance Members</div> <div class="clearfix"> <div class='border-bottom' id='scaperstartwithurls'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>startWithURLs(listOfURLs)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>This is mandatory.<br> Take the list of urls used as the starting point.</p> <div class='pre p1 fill-light mt0'>startWithURLs(listOfURLs: (<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a> | <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Array">Array</a>&#x3C;<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>>))</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>listOfURLs</span> <code class='quiet'>((<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a> | <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Array">Array</a>&#x3C;<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>>))</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// add the urls as the starting point</span> scaper.startWithURLs([<span class="hljs-string">'www.googl.com'</span>,<span class="hljs-string">'www.bing.com'</span>]) scaper.startWithURLs(<span class="hljs-string">'www.googl.com'</span>)</pre> </section> </div> </div> <div class='border-bottom' id='scaperallowifmatches'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>allowIfMatches(nonAsyncFunction)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Takes a non async callback function as argument,url added to processing queue only if the function return true value.<br> This is optional. By default is accept all urls added to processing queue.<br></p> <div class='pre p1 fill-light mt0'>allowIfMatches(nonAsyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>nonAsyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// accept url contains www.google.com</span> scraper.allowIfMatches(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">url</span>) </span>{ <span class="hljs-keyword">return</span> url.indexOf(<span class="hljs-string">'www.google.com'</span>)&gt;<span class="hljs-number">-1</span> })</pre> </section> </div> </div> <div class='border-bottom' id='scapersaveprogressinfile'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>saveProgressInFile(filePath)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>This is optional setting.<br> This will save your progress in the file and you can stop and start the scraper from the previous state.<br> The file is a sqlite db file you can modify the content using sqllite clients.<br> If no file specified the stored in memory..</p> <div class='pre p1 fill-light mt0'>saveProgressInFile(filePath: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>filePath</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">// state stored in state.db file</span> scraper.saveProgressInFile(<span class="hljs-string">"./state.db"</span>)</pre> </section> </div> </div> <div class='border-bottom' id='scaperenableautocrawler'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>enableAutoCrawler(flag, enableAutoCrawler)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>This will allow the scraper to automatically download all the links form the page and add to processing queue.<br> Note the urls will be filtered if allowIfMatches function return 'false'.</p> <div class='pre p1 fill-light mt0'>enableAutoCrawler(flag: any, enableAutoCrawler: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Boolean">boolean</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>flag</span> <code class='quiet'>(any)</code> </div> </div> <div class='space-bottom0'> <div> <span class='code bold'>enableAutoCrawler</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Boolean">boolean</a>)</code> true to enable </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.enableAutoCrawler(<span class="hljs-literal">true</span>)</pre> </section> </div> </div> <div class='border-bottom' id='scaperwaitbetweenpageload'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>waitBetweenPageLoad(delayInMilliSeconds)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Time delay between each page load in milliseconds</p> <div class='pre p1 fill-light mt0'>waitBetweenPageLoad(delayInMilliSeconds: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>delayInMilliSeconds</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a> = <code>0</code>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'><span class="hljs-comment">//wait for 90 milliseconds between page load</span> scraper.waitBetweenPageLoad(<span class="hljs-number">90</span>)</pre> </section> </div> </div> <div class='border-bottom' id='scapercallbackonfinish'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>callbackOnFinish(asyncFunction)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Final callback when scarping is completed</p> <div class='pre p1 fill-light mt0'>callbackOnFinish(asyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>asyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Number">number</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.callbackOnFinish(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">result</span>)</span>{ <span class="hljs-built_in">console</span>.log(result) })</pre> </section> </div> </div> <div class='border-bottom' id='scapercallbackonpageload'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>callbackOnPageLoad(asyncFunction)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>This is the main function.Your scarping logic to be defined in the function.<br> This called for each page in the processing queue.<br> Called with pupetter page object as input.<br> The page object input got addtional methods to support scraping</p> <div class='pre p1 fill-light mt0'>callbackOnPageLoad(asyncFunction: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>asyncFunction</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Statements/function">function</a>)</code> a sync function with single input argument page. </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.waitBetweenPageLoad(<span class="hljs-number">90</span>)</pre> </section> </div> </div> <div class='border-bottom' id='scaperstart'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>start()</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>To start the scraping process. callbackOnFinish function is called once the scraping is completed.</p> <div class='pre p1 fill-light mt0'>start()</div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.start()</pre> </section> </div> </div> </div> </section> <section class='p2 mb2 clearfix bg-white minishadow'> <div class='clearfix'> <h3 class='fl m0' id='page'> Page </h3> </div> <p>Pupetter page class. Enhanced with supporting function detailed below.</p> <div class='pre p1 fill-light mt0'>new Page()</div> <div class='py1 quiet mt1 prose-big'>Instance Members</div> <div class="clearfix"> <div class='border-bottom' id='pagedownload_image'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>download_image(image_download_url, where_to_full_file_path)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Download image from url and save to local disk</p> <div class='pre p1 fill-light mt0'>download_image(image_download_url: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>, where_to_full_file_path: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>image_download_url</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> <div class='space-bottom0'> <div> <span class='code bold'>where_to_full_file_path</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{ <span class="hljs-keyword">var</span> img = <span class="hljs-keyword">await</span> page.$(<span class="hljs-string">'img'</span>) <span class="hljs-keyword">var</span> img_src = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function"><span class="hljs-params">img</span> =&gt;</span> img.getAttribute(<span class="hljs-string">"src"</span>), img); page.download_image(img_src,<span class="hljs-string">"usr/test/profile.png"</span>) })</pre> </section> </div> </div> <div class='border-bottom' id='pagesaveresult'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>saveResult(text)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Save the text result ,this will returned as input to callbackOnFinish function<br> Each url can store one result</p> <div class='pre p1 fill-light mt0'>saveResult(text: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>text</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{ <span class="hljs-keyword">var</span> article = <span class="hljs-keyword">await</span> page.$<span class="hljs-built_in">eval</span>(<span class="hljs-string">'article'</span>, tag =&gt; tag.innerText); page.saveResult(article) })</pre> </section> </div> </div> <div class='border-bottom' id='pagewrite_text_to_file'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>write_text_to_file(content, filename)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Write text content to local file</p> <div class='pre p1 fill-light mt0'>write_text_to_file(content: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>, filename: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>content</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> <div class='space-bottom0'> <div> <span class='code bold'>filename</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{ <span class="hljs-keyword">var</span> article = <span class="hljs-keyword">await</span> page.$<span class="hljs-built_in">eval</span>(<span class="hljs-string">'article'</span>, tag =&gt; tag.innerText); page.download_image(article,<span class="hljs-string">"usr/test/article.txt"</span>) });</pre> </section> </div> </div> <div class='border-bottom' id='pageadd_url_to_queue'> <div class="clearfix small pointer toggle-sibling"> <div class="py1 contain"> <a class='icon pin-right py1 dark-link caret-right'></a> <span class='code strong strong truncate'>add_url_to_queue(url)</span> </div> </div> <div class="clearfix display-none toggle-target"> <section class='p2 mb2 clearfix bg-white minishadow'> <p>Add the url to processing queue</p> <div class='pre p1 fill-light mt0'>add_url_to_queue(url: <a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</div> <div class='py1 quiet mt1 prose-big'>Parameters</div> <div class='prose'> <div class='space-bottom0'> <div> <span class='code bold'>url</span> <code class='quiet'>(<a href="https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String">string</a>)</code> </div> </div> </div> <div class='py1 quiet mt1 prose-big'>Example</div> <pre class='p1 overflow-auto round fill-light'>scraper.callbackOnPageLoad(<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">page</span>)</span>{ <span class="hljs-keyword">var</span> a = <span class="hljs-keyword">await</span> page.$(<span class="hljs-string">'a'</span>) <span class="hljs-keyword">var</span> url = <span class="hljs-keyword">await</span> page.evaluate(<span class="hljs-function"><span class="hljs-params">a</span> =&gt;</span> a.getAttribute(<span class="hljs-string">"href"</span>), a); page.add_url_to_queue(url) });</pre> </section> </div> </div> </div> </section> </div> </div> <script src='assets/anchor.js'></script> <script src='assets/split.js'></script> <script src='assets/site.js'></script> </body> </html>