scrapegraph-js
Version:
Scrape and extract structured data from a webpage using ScrapeGraphAI's APIs. Supports cookies for authentication, infinite scrolling, and pagination.
439 lines (364 loc) • 12.9 kB
Markdown
# 🤖 Agentic Scraper
The Agentic Scraper enables AI-powered browser automation for complex interactions like form filling, clicking buttons, and navigating multi-step workflows.
## 🚀 Quick Start
### Basic Usage (No AI Extraction)
```javascript
import { agenticScraper, getAgenticScraperRequest } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://dashboard.scrapegraphai.com/';
const steps = [
'Type email@gmail.com in email input box',
'Type test-password@123 in password inputbox',
'click on login'
];
// Submit automation request (basic scraping)
const response = await agenticScraper(apiKey, url, steps, true);
console.log('Request ID:', response.request_id);
// Check results
const result = await getAgenticScraperRequest(apiKey, response.request_id);
console.log('Status:', result.status);
console.log('Markdown Content:', result.markdown);
```
### AI Extraction Usage
```javascript
import { agenticScraper, getAgenticScraperRequest } from 'scrapegraph-js';
const apiKey = 'your-api-key';
const url = 'https://dashboard.scrapegraphai.com/';
const steps = [
'Type email@gmail.com in email input box',
'Type test-password@123 in password inputbox',
'click on login',
'wait for dashboard to load'
];
// Define extraction schema
const outputSchema = {
user_info: {
type: "object",
properties: {
username: { type: "string" },
email: { type: "string" },
dashboard_sections: { type: "array", items: { type: "string" } }
}
}
};
// Submit automation request with AI extraction
const response = await agenticScraper(
apiKey,
url,
steps,
true, // useSession
"Extract user information and available dashboard sections", // userPrompt
outputSchema, // outputSchema
true // aiExtraction
);
console.log('Request ID:', response.request_id);
// Check results
const result = await getAgenticScraperRequest(apiKey, response.request_id);
if (result.status === 'completed') {
console.log('Extracted Data:', result.result);
console.log('Raw Markdown:', result.markdown);
}
```
## 📚 API Reference
### `agenticScraper(apiKey, url, steps, useSession, userPrompt, outputSchema, aiExtraction)`
Performs automated browser actions on a webpage with optional AI extraction.
**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `url` (string): The URL of the webpage to interact with
- `steps` (string[]): Array of automation steps to perform
- `useSession` (boolean, optional): Whether to use session management (default: true)
- `userPrompt` (string, optional): Prompt for AI extraction (required when aiExtraction=true)
- `outputSchema` (object, optional): Schema for structured data extraction (used with aiExtraction=true)
- `aiExtraction` (boolean, optional): Whether to use AI for data extraction (default: false)
**Returns:** Promise<Object> with `request_id` and initial `status`
**Example Steps:**
```javascript
const steps = [
'click on search bar',
'type "laptop" in search input',
'press Enter key',
'wait for 2 seconds',
'click on first result',
'scroll down to reviews'
];
```
### `getAgenticScraperRequest(apiKey, requestId)`
Retrieves the status or result of an agentic scraper request.
**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `requestId` (string): The request ID from a previous agentic scraper call
**Returns:** Promise<Object> with:
- `status`: 'pending', 'completed', or 'failed'
- `result`: Automation results (when completed)
- `error`: Error message (when failed)
- `created_at`: Request creation timestamp
- `completed_at`: Completion timestamp (when completed)
## 🎯 Use Cases
### 1. **Basic Automation (No AI)**
Perfect for simple automation tasks where you just need the raw HTML/markdown content:
- **Login automation**: Automate login flows and capture the resulting page
- **Form submission**: Fill out forms and get confirmation pages
- **Navigation**: Navigate through multi-step workflows
- **Content scraping**: Get page content after performing actions
### 2. **AI-Powered Data Extraction**
Ideal when you need structured data from the automated interactions:
- **Dashboard data extraction**: Login and extract user information, metrics, settings
- **E-commerce scraping**: Search products and extract structured product data
- **Form result parsing**: Submit forms and extract confirmation details, reference numbers
- **Content analysis**: Navigate to content and extract key information in structured format
### 3. **Hybrid Approach**
Use both modes depending on your needs:
- **Development/Testing**: Start with basic mode to test automation steps
- **Production**: Add AI extraction for structured data processing
- **Fallback**: Use basic mode when AI extraction isn't needed
## 💡 AI Extraction Examples
### E-commerce Product Search
```javascript
const steps = [
'click on search box',
'type "wireless headphones" in search',
'press enter',
'wait for results to load',
'scroll down 2 times'
];
const schema = {
products: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "string" },
rating: { type: "number" },
availability: { type: "string" }
}
}
}
};
const response = await agenticScraper(
apiKey,
'https://example-store.com',
steps,
true,
'Extract product names, prices, ratings, and availability from search results',
schema,
true
);
```
### Contact Form with Confirmation
```javascript
const steps = [
'type "John Doe" in name field',
'type "john@example.com" in email field',
'type "Product inquiry" in subject field',
'type "I need more information about pricing" in message field',
'click submit button',
'wait for confirmation'
];
const schema = {
submission: {
type: "object",
properties: {
status: { type: "string" },
message: { type: "string" },
reference_number: { type: "string" },
response_time: { type: "string" }
}
}
};
const response = await agenticScraper(
apiKey,
'https://company.com/contact',
steps,
true,
'Extract form submission status, confirmation message, and any reference numbers',
schema,
true
);
```
### Social Media Data Extraction
```javascript
const steps = [
'type "username" in username field',
'type "password" in password field',
'click login button',
'wait for dashboard',
'click on profile section'
];
const schema = {
profile: {
type: "object",
properties: {
username: { type: "string" },
followers: { type: "number" },
following: { type: "number" },
posts: { type: "number" },
recent_activity: { type: "array", items: { type: "string" } }
}
}
};
const response = await agenticScraper(
apiKey,
'https://social-platform.com/login',
steps,
true,
'Extract profile information including username, follower counts, and recent activity',
schema,
true
);
```
## 🔧 Best Practices
### When to Use AI Extraction
- ✅ **Use AI extraction when**: You need structured data, specific information extraction, or data validation
- ❌ **Skip AI extraction when**: You just need raw content, testing automation steps, or processing content externally
### Schema Design Tips
- **Be specific**: Define exact data types and required fields
- **Use descriptions**: Add description fields to guide AI extraction
- **Nested objects**: Use nested schemas for complex data structures
- **Arrays**: Use arrays for lists of similar items (products, comments, etc.)
### Step Optimization
- **Wait steps**: Add wait steps after actions that trigger loading
- **Specific selectors**: Use specific element descriptions ("click on blue submit button")
- **Sequential actions**: Break complex actions into smaller, specific steps
- **Error handling**: Include steps to handle common UI variations
### 🔐 Login Automation
```javascript
const loginSteps = [
'click on email input',
'type "user@example.com" in email field',
'click on password input',
'type "password123" in password field',
'click login button',
'wait for dashboard to load'
];
const response = await agenticScraper(apiKey, 'https://app.example.com/login', loginSteps, true);
```
### 🛒 E-commerce Interaction
```javascript
const shoppingSteps = [
'click on search bar',
'type "wireless headphones" in search',
'press Enter',
'wait for results to load',
'click on first product',
'click add to cart button',
'click view cart'
];
const response = await agenticScraper(apiKey, 'https://shop.example.com', shoppingSteps, true);
```
### 📝 Form Submission
```javascript
const formSteps = [
'click on name input',
'type "John Doe" in name field',
'click on email input',
'type "john@example.com" in email field',
'click on message textarea',
'type "Hello, this is a test message" in message area',
'click submit button'
];
const response = await agenticScraper(apiKey, 'https://example.com/contact', formSteps, false);
```
## ⚡ Advanced Usage
### Polling for Results
```javascript
async function waitForCompletion(requestId, timeoutSeconds = 120) {
const startTime = Date.now();
const timeout = timeoutSeconds * 1000;
while (Date.now() - startTime < timeout) {
const status = await getAgenticScraperRequest(apiKey, requestId);
if (status.status === 'completed') {
return status.result;
} else if (status.status === 'failed') {
throw new Error(status.error);
}
await new Promise(resolve => setTimeout(resolve, 5000)); // Wait 5 seconds
}
throw new Error('Timeout waiting for completion');
}
```
### Error Handling
```javascript
try {
const response = await agenticScraper(apiKey, url, steps, true);
const result = await waitForCompletion(response.request_id);
console.log('Automation successful:', result);
} catch (error) {
if (error.message.includes('validation')) {
console.log('Input validation failed:', error.message);
} else if (error.message.includes('timeout')) {
console.log('Automation timed out');
} else {
console.log('Automation failed:', error.message);
}
}
```
## 📝 Step Syntax
Steps should be written in natural language describing the action to perform:
### Clicking Elements
- `"click on login button"`
- `"click on search icon"`
- `"click on first result"`
### Typing Text
- `"type 'username' in email field"`
- `"type 'password123' in password input"`
- `"type 'search query' in search box"`
### Keyboard Actions
- `"press Enter key"`
- `"press Tab key"`
- `"press Escape key"`
### Waiting
- `"wait for 2 seconds"`
- `"wait for page to load"`
- `"wait for results to appear"`
### Scrolling
- `"scroll down"`
- `"scroll to bottom"`
- `"scroll to top"`
## 🔧 Best Practices
1. **Use Session Management**: Set `useSession: true` for multi-step workflows
2. **Add Wait Steps**: Include wait times between actions for reliability
3. **Be Specific**: Use descriptive selectors like "login button" vs "button"
4. **Handle Timeouts**: Implement proper timeout handling for long operations
5. **Validate Inputs**: Check URLs and steps before making requests
## 🚨 Common Errors
### Input Validation Errors
```javascript
// ❌ Invalid URL
await agenticScraper(apiKey, 'not-a-url', steps);
// ❌ Empty steps
await agenticScraper(apiKey, url, []);
// ❌ Invalid step
await agenticScraper(apiKey, url, ['click button', '']); // Empty step
```
### Runtime Errors
- **Element not found**: Make steps more specific or add wait times
- **Timeout**: Increase polling timeout or break down complex steps
- **Session expired**: Use session management for multi-step flows
## 🌐 cURL Equivalent
```bash
curl --location 'https://api.scrapegraphai.com/v1/agentic-scrapper' \
--header 'SGAI-APIKEY: your-api-key' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://dashboard.scrapegraphai.com/",
"use_session": true,
"steps": [
"Type email@gmail.com in email input box",
"Type test-password@123 in password inputbox",
"click on login"
]
}'
```
## 📖 Examples
Check out the example files in the `/examples` directory:
- `agenticScraper_example.js` - Basic usage
- `getAgenticScraperRequest_example.js` - Status checking
- `agenticScraper_complete_example.js` - Complete workflow
- `agenticScraper_advanced_example.js` - Advanced patterns with error handling
## 💡 Tips
- Start with simple steps and gradually add complexity
- Test individual steps before combining them
- Use browser developer tools to identify element selectors
- Consider mobile vs desktop layouts when writing steps
- Monitor request status regularly for long-running automations