UNPKG

@letsscrapedata/scraper

Version:

Web scraper that scraping web pages by LetsScrapeData XML template

241 lines (209 loc) 8.39 kB
<div align="center"> <div> <a href="https://www.LetsScrapeData.com" style="text-decoration: none" target="_blank"> <img src="https://www.letsscrapedata.com/assets/logo.svg" width="160" alt="LetsScrapeData"> </a> </div> <!-- <div>This is part of LetsScrapeData <a href="https://www.npmjs.com/~letsscrapedata"> web scraping suites </a>.</div> --> <div>You can use a free <a href="https://www.LetsScrapeData.com">LetsScrapeData App</a> if you want to scrape web data without programming.</div> <br/> </div> <font size=4>Please get help and discuss how to scrape a website on the [discord server](https://discord.gg/46atZ8kPVb), which can respond quickly. It is better to submit issues on [github](https://github.com/LetsScrapeData/scraper) for better tracking.</font> ## Features 1. Template driven web scraping - you can quickly [design templates](https://doc.letsscrapedata.com/template/) for scraping different websites. - The templates are intuitive and easier to maintain. 2. Browser operations supported by the [controller](https://www.npmjs.com/package/@letsscrapedata/controller) package - Same interface of playwright, patchright, camoufox, puppeteer, cheerio: easy to switch between them - Web browsing automation: goto(open) / click / input / hover / select / scroll - Automatic captcha solver: Recaptcha(v2 & v3), Cloudflare Turnstile, GeeTest(v3 & v4), image/text, cooridinate - State data management: cookies, localStorage, HTTP Headers, custom session data - Elements selection by CSS selectors or XPath: whether in frames or not - Automatic file saving: such as screenshot, pdf, mhtml, download directly or by clicking 3. API request - Both browser and API can be used at the same time and cookies/headers are shared. - HTTP headers: intercepted, generated automatically or by browser automation, got by API or others 4. fingerprint management: - Automatically generate fingerprints of the latest common browsers 5. Simple rate limits: automatic flow control, such as interval / max concurrency /times per period 6. Simple proxy management: multiple "static" proxies to increase concurrency 7. Subtasks: complex tasks can be split into multiple simple subtasks for better maintenance and increased concurrency 8. Data export ## Install ```sh npm install @letsscrapedata/scraper ``` ## Examples 1. Example with default ScraperConfig: ```javascript // javascript import { scraper } from "@letsscrapedata/sraper"; /** * tid: ID of template to be executed, such as template for scraping one list of example in page "https://www.letsscrapedata.com/pages/listexample1.html" * parasstrs: input parameters of tasks, such as "1" * this example will execute five tasks using template 10001, each of them scrapes the data in one page. */ const newTasks = [{ tid: 10001, parasstrs: ["1", "2", "3", "4", "5"] }]; /* The following line can do the same thing using subtasks, scraping the data in the first five pages */ // const newTasks = [{ tid: 10002, parasstrs: ["5"] }]; await scraper(newTasks); ``` 2. Example with ScraperConfig ```typescript // typescript import { scraper, TemplateTasks, ScraperConfig } from "@letsscrapedata/sraper"; const scraperConfig: ScraperConfig = { browserConfigs: [ /* launch a chromium browser using puppeteer, no proxy */ { browserControllerType: "puppeteer", proxyUrl: "" }, /* launch a chromium browser using playwright, proxy */ { browserContollerType: "playwright", proxyUrl: "http://proxyId:port" }, /* connect to the current browser using patchright */ { browserUrl: "http://localhost:9222/" }, ], // exitWhenCompleted: true, // lsdLaunchOptions: { headless: true }, // loadUnfinishedTasks: true, // loadFailedTasksInterval: 5 // captcha: { clientKey: "xxx" } // to solve captcha using 2captca }; const newTasks: TemplateTasks[] = [{ tid: 10002, parasstrs: ["9"] }]; await scraper(newTasks, scraperConfig); ``` ## ScraperConfig Common configurations: - Proxies and browser: browserConfigs, by default launching a browser using browserControllerType/browserType, without proxy - Launch options of browser: lsdLaunchOptions, default {headless: false} - Whether to load unfinished tasks: loadUnfinishedTasks, default false - Whether to exist when completed: exitWhenCompleted, default false - File format of scraped data: dataFileFormat, default "jsonl" - API Key of captcha solver: captcha.clientKey Complete configurations: ```typescript export interface ScraperConfig { /** * @default false */ exitWhenCompleted?: boolean; /** * whether to use the parasstr in XML if parasstr of a task is "" * @default false */ useParasstrInXmlIfNeeded?: boolean; /** * whether to load unfinished tasks * @default false */ loadUnfinishedTasks?: boolean; //////////////////////////////////////////////////////////////////////////// directory /** * @default "", which will use current directory of process + "/data/" * if not empty, baseDir must be an absolute path, and the directory must exist and have read and write permissions. */ baseDir?: string; /** * filename in action_setvar_get/get_file must include inputFileDirePart for security. * @default "LetsScrapeData" */ inputFileDirPart?: string; //////////////////////////////////////////////////////////////////////////// browser /** * wether to use puppeteer-extra-plugin-stealth, use patchright instead * @default false */ useStealthPlugin?: boolean; /** * default browserControllerType of BrowserConfig * @default "patchright" */ browserControllerType?: BrowserControllerType; /** * default browserType of BrowserConfig * @default "chromium" */ browserType?: LsdBrowserType; /** * @default { headless: false, geoip: true } */ lsdLaunchOptions?: LsdLaunchOptions; /** * @default {browserUrl: ""} */ lsdConnectOptions?: LsdConnectOptions; /** * Important: browsers to be launched or connected using proxyUrl * @default [{proxyUrl: ""}], launch a default browser using default type of browser controller, no proxy */ browserConfigs?: BrowserConfig[]; //////////////////////////////////////////////////////////////////////////// captcha captcha?: { /** * clientKey of 2captcha */ clientKey: string; // if you need to solve captcha in camoufox, please contact administrator }, //////////////////////////////////////////////////////////////////////////// template /** * the default maximum number of concurrent tasks that can execute the same template in a browserContext * @default 1 */ maxConcurrency?: number; /** * @default "" */ readCode?: string; /** * @default [] */ templateParas?: TemplatePara[]; //////////////////////////////////////////////////////////////////////////// scheduler /** * @default 10 */ totalMaxConcurrency?: number; /** * min miliseconds between two tasks of the same template * @default 2000 */ minMiliseconds?: number, //////////////////////////////////////////////////////////////////////////// data /** * whether to move all dat_* files into a new directory "yyyyMMddHHmmss" * @default false */ moveDataWhenStart?: boolean; /** ** DataFileFormat = "csv" | "jsonl" | "tsv" | "txt"; * @default "jsonl" */ dataFileFormat?: DataFileFormat; * valid only when dataFileFormat is "txt" */ columnSeperator?: string; } /** * Only one of browserUrl and proxyUrl will take effect, and browserUrl has higher priority. */ export interface BrowserConfig { browserControllerType?: BrowserControllerType; /** * url used to connected the current browser ** url starts with "http://", such as "http://localhost:9222/" ** browserUrl can be used when mannaul login in advance. */ browserUrl?: string; /** * proxy ** no proxy will be used if proxyUrl is "" ** valid only if !browserUrl */ proxyUrl?: string; /** * type of browser to be launched * valid only if !browserUrl * @default "chromium" */ browserType?: LsdBrowserType; } ```