scrapegraph-js
Version:
Scrape and extract structured data from a webpage using ScrapeGraphAI's APIs. Supports cookies for authentication, infinite scrolling, and pagination.
642 lines (491 loc) • 17.5 kB
Markdown
//badge.fury.io/js/scrapegraph-js.svg)](https://badge.fury.io/js/scrapegraph-js) [](https://opensource.org/licenses/MIT) [](https://docs.scrapegraphai.com)
<p align="left">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
</p>
Official JavaScript/TypeScript SDK for the ScrapeGraph AI API - Smart web scraping powered by AI.
- ✨ Smart web scraping with AI
- 🔄 Fully asynchronous design
- 🔍 Detailed error handling
- ⚡ Automatic retries and logging
- 🔐 Secure API authentication
Install the package using npm or yarn:
```bash
npm i scrapegraph-js
yarn add scrapegraph-js
```
> **Note**: Store your API keys securely in environment variables. Use `.env` files and libraries like `dotenv` to load them into your app.
```javascript
import { smartScraper } from 'scrapegraph-js';
import 'dotenv/config';
// Initialize variables
const apiKey = process.env.SGAI_APIKEY; // Set your API key as an environment variable
const websiteUrl = 'https://example.com';
const prompt = 'What does the company do?';
(async () => {
try {
const response = await smartScraper(apiKey, websiteUrl, prompt);
console.log(response.result);
} catch (error) {
console.error('Error:', error);
}
})();
```
```javascript
import { scrape } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com';
(async () => {
try {
const response = await scrape(apiKey, url);
console.log('HTML content:', response.html);
console.log('Status:', response.status);
} catch (error) {
console.error('Error:', error);
}
})();
```
```javascript
import { scrape } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com';
(async () => {
try {
const response = await scrape(apiKey, url, {
renderHeavyJs: true
});
console.log('HTML content with JS rendering:', response.html);
} catch (error) {
console.error('Error:', error);
}
})();
```
```javascript
import { scrape } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com';
(async () => {
try {
const response = await scrape(apiKey, url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Cookie': 'session=123'
}
});
console.log('HTML content with custom headers:', response.html);
} catch (error) {
console.error('Error:', error);
}
})();
```
```javascript
import { getScrapeRequest } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const requestId = 'your-request-id';
(async () => {
try {
const response = await getScrapeRequest(apiKey, requestId);
console.log('Request status:', response.status);
if (response.status === 'completed') {
console.log('HTML content:', response.html);
}
} catch (error) {
console.error('Error:', error);
}
})();
```
```javascript
import { smartScraper } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com';
const prompt = 'Extract the main heading and description.';
(async () => {
try {
const response = await smartScraper(apiKey, url, prompt);
console.log(response.result);
} catch (error) {
console.error('Error:', error);
}
})();
```
> [!NOTE]
> To use this feature, it is necessary to employ the [Zod](https://www.npmjs.com/package/zod) package for schema creation.
Here is a real-world example:
```javascript
import { smartScraper } from 'scrapegraph-js';
import { z } from 'zod';
const apiKey = 'your-api-key';
const url = 'https://scrapegraphai.com/';
const prompt = 'What does the company do? and ';
const schema = z.object({
title: z.string().describe('The title of the webpage'),
description: z.string().describe('The description of the webpage'),
summary: z.string().describe('A brief summary of the webpage'),
});
(async () => {
try {
const response = await smartScraper(apiKey, url, prompt, schema);
console.log(response.result);
} catch (error) {
console.error('Error:', error);
}
})();
```
For websites that load content dynamically through infinite scrolling (like social media feeds), you can use the `numberOfScrolls` parameter:
```javascript
import { smartScraper } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com/infinite-scroll-page';
const prompt = 'Extract all the posts from the feed';
const numberOfScrolls = 10; // Will scroll 10 times to load more content
(async () => {
try {
const response = await smartScraper(apiKey, url, prompt, null, numberOfScrolls);
console.log('Extracted data from scrolled page:', response);
} catch (error) {
console.error('Error:', error);
}
})();
```
The `numberOfScrolls` parameter accepts values between 0 and 100, allowing you to control how many times the page should be scrolled before extraction.
Use cookies for authentication and session management when scraping websites that require login or have user-specific content:
```javascript
import { smartScraper } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com/dashboard';
const prompt = 'Extract user profile information';
// Define cookies for authentication
const cookies = {
session_id: 'abc123def456',
auth_token: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
user_preferences: 'dark_mode,usd'
};
(async () => {
try {
const response = await smartScraper(apiKey, url, prompt, null, null, null, cookies);
console.log(response.result);
} catch (error) {
console.error('Error:', error);
}
})();
```
**Common Use Cases:**
- **E-commerce sites**: User authentication, shopping cart persistence
- **Social media**: Session management, user preferences
- **Banking/Financial**: Secure authentication, transaction history
- **News sites**: User preferences, subscription content
- **API endpoints**: Authentication tokens, API keys
Combine cookies with infinite scrolling and pagination for comprehensive data extraction:
```javascript
import { smartScraper } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://example.com/feed';
const prompt = 'Extract all posts from the feed';
const cookies = { session_token: 'xyz789abc123' };
const numberOfScrolls = 10; // Scroll 10 times
const totalPages = 5; // Scrape 5 pages
(async () => {
try {
const response = await smartScraper(apiKey, url, prompt, null, numberOfScrolls, totalPages, cookies);
console.log('Extracted data:', response);
} catch (error) {
console.error('Error:', error);
}
})();
```
Search and extract information from multiple web sources using AI.
```javascript
import { searchScraper } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const prompt = 'What is the latest version of Python and what are its main features?';
(async () => {
try {
const response = await searchScraper(apiKey, prompt);
console.log(response.result);
} catch (error) {
console.error('Error:', error);
}
})();
```
Start a crawl job to extract structured data from a website and its linked pages, using a custom schema.
```javascript
import { crawl, getCrawlRequest } from 'scrapegraph-js';
import 'dotenv/config';
const apiKey = process.env.SGAI_APIKEY;
const url = 'https://scrapegraphai.com/';
const prompt = 'What does the company do? and I need text content from there privacy and terms';
const schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "ScrapeGraphAI Website Content",
"type": "object",
"properties": {
"company": {
"type": "object",
"properties": {
"name": { "type": "string" },
"description": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } },
"contact_email": { "type": "string", "format": "email" },
"social_links": {
"type": "object",
"properties": {
"github": { "type": "string", "format": "uri" },
"linkedin": { "type": "string", "format": "uri" },
"twitter": { "type": "string", "format": "uri" }
},
"additionalProperties": false
}
},
"required": ["name", "description"]
},
"services": {
"type": "array",
"items": {
"type": "object",
"properties": {
"service_name": { "type": "string" },
"description": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
},
"required": ["service_name", "description"]
}
},
"legal": {
"type": "object",
"properties": {
"privacy_policy": { "type": "string" },
"terms_of_service": { "type": "string" }
},
"required": ["privacy_policy", "terms_of_service"]
}
},
"required": ["company", "services", "legal"]
};
(async () => {
try {
// Start the crawl job
const crawlResponse = await crawl(apiKey, url, prompt, schema, {
cacheWebsite: true,
depth: 2,
maxPages: 2,
sameDomainOnly: true,
batchSize: 1,
});
console.log('Crawl job started. Response:', crawlResponse);
// If the crawl is asynchronous and returns an ID, fetch the result
const crawlId = crawlResponse.id || crawlResponse.task_id;
if (crawlId) {
for (let i = 0; i < 10; i++) {
await new Promise((resolve) => setTimeout(resolve, 5000));
const result = await getCrawlRequest(apiKey, crawlId);
if (result.status === 'success' && result.result) {
console.log('Crawl completed. Result:', result.result.llm_result);
break;
} else if (result.status === 'failed') {
console.log('Crawl failed. Result:', result);
break;
} else {
console.log(`Status: ${result.status}, waiting...`);
}
}
} else {
console.log('No crawl ID found in response. Synchronous result:', crawlResponse);
}
} catch (error) {
console.error('Error occurred:', error);
}
})();
```
You can use a plain JSON schema or a [Zod](https://www.npmjs.com/package/zod) schema for the `schema` parameter. The crawl API supports options for crawl depth, max pages, domain restriction, and batch size.
Extract structured data from local HTML content
```javascript
import { localScraper } from 'scrapegraph-js';
const apiKey = 'your_api_key';
const prompt = 'What does the company do?';
const websiteHtml = `<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: contact@example.com</p>
</div>
</body>
</html>`;
(async () => {
try {
const response = await localScraper(apiKey, websiteHtml, prompt);
console.log(response);
} catch (error) {
console.error(error);
}
})();
```
Converts a webpage into clean, well-structured markdown format.
```javascript
import { smartScraper } from 'scrapegraph-js';
const apiKey = 'your_api_key';
const url = 'https://scrapegraphai.com/';
(async () => {
try {
const response = await markdownify(apiKey, url);
console.log(response);
} catch (error) {
console.error(error);
}
})();
```
```javascript
import { getCredits } from 'scrapegraph-js';
const apiKey = 'your-api-key';
(async () => {
try {
const credits = await getCredits(apiKey);
console.log('Available credits:', credits);
} catch (error) {
console.error('Error fetching credits:', error);
}
})();
```
```javascript
import { sendFeedback } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const requestId = '16a63a80-c87f-4cde-b005-e6c3ecda278b';
const rating = 5;
const feedbackText = 'This is a test feedback message.';
(async () => {
try {
const response = await sendFeedback(apiKey, requestId, rating, feedbackText);
console.log('Feedback response:', response);
} catch (error) {
console.error('Error sending feedback:', error);
}
})();
```
Converts a webpage into HTML format with optional JavaScript rendering.
**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `url` (string): The URL of the webpage to convert
- `options` (object, optional): Configuration options
- `renderHeavyJs` (boolean, optional): Whether to render heavy JavaScript (default: false)
- `headers` (object, optional): Custom headers to send with the request
**Returns:** Promise that resolves to an object containing:
- `html`: The HTML content of the webpage
- `status`: Request status ('completed', 'processing', 'failed')
- `scrape_request_id`: Unique identifier for the request
- `error`: Error message if the request failed
**Example:**
```javascript
const response = await scrape(apiKey, 'https://example.com', {
renderHeavyJs: true,
headers: { 'User-Agent': 'Custom Agent' }
});
```
Retrieves the status or result of a previous scrape request.
**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `requestId` (string): The unique identifier for the scrape request
**Returns:** Promise that resolves to the request result object.
**Example:**
```javascript
const result = await getScrapeRequest(apiKey, 'request-id-here');
```
Extracts structured data from websites using AI-powered scraping.
**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `url` (string): The URL of the website to scrape
- `prompt` (string): Natural language prompt describing what to extract
- `schema` (object, optional): Zod schema for structured output
- `numberOfScrolls` (number, optional): Number of scrolls for infinite scroll pages
- `totalPages` (number, optional): Number of pages to scrape
- `cookies` (object, optional): Cookies for authentication
Searches and extracts information from multiple web sources using AI.
Starts a crawl job to extract structured data from a website and its linked pages.
Converts a webpage into clean, well-structured markdown format.
Performs automated actions on webpages using step-by-step instructions.
Retrieves your current credit balance and usage statistics.
Submits feedback for a specific request.
For detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com)
1. Clone the repository:
```bash
git clone https://github.com/ScrapeGraphAI/scrapegraph-sdk.git
cd scrapegraph-sdk/scrapegraph-js
```
2. Install dependencies:
```bash
npm install
```
3. Run linting and testing:
```bash
npm run lint
npm test
```
```bash
npm test
npm run test:coverage
```
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## 🔗 Links
- [Website](https://scrapegraphai.com)
- [Documentation](https://docs.scrapegraphai.com)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)
## 💬 Support
- 📧 Email: support@scrapegraphai.com
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new)
---
Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)
[![npm version](https: