weavebot-core
Version:
Generic content processing framework for web scraping and AI extraction
123 lines (91 loc) • 3.1 kB
Markdown
# @weavebot/core
Generic content processing framework for web scraping and AI extraction.
## Overview
`@weavebot/core` is a lightweight, plugin-based framework for extracting structured data from web content. It provides infrastructure without implementation details, allowing you to build custom content processing pipelines.
## Features
- 🔌 **Plugin Architecture** - Extend functionality without modifying core
- 🤖 **Schema-Driven AI Extraction** - Register custom schemas for any data type
- 🌐 **Generic Web Scraper** - Platform-agnostic with plugin support
- 💾 **Flexible Storage Interface** - Use any backend (Airtable, MongoDB, etc.)
- 📝 **Dynamic Schema Registry** - Register schemas at runtime
- 🔧 **Zero Implementation Details** - Pure infrastructure, no domain logic
## Installation
```bash
npm install @weavebot/core
```
## Quick Start
```typescript
import ContentProcessor, {
createWebScraper,
createAIExtractor,
SchemaRegistry
} from '@weavebot/core';
import { z } from 'zod';
// Create processor instance
const processor = new ContentProcessor();
// Register your schema
const ArticleSchema = z.object({
title: z.string(),
author: z.string(),
content: z.string(),
publishedAt: z.date()
});
processor.registerSchema('article', ArticleSchema);
// Set up processors
const scraper = createWebScraper();
const extractor = createAIExtractor({
provider: 'openai',
apiKey: process.env.OPENAI_API_KEY
});
// Register extraction configuration
extractor.registerExtractor('article', {
schema: ArticleSchema,
systemPrompt: 'Extract article information from the content',
userPromptTemplate: 'Extract article from: {{content}}'
});
processor.addProcessor('web-scraper', scraper);
processor.addProcessor('ai-extractor', extractor);
// Process a URL
const result = await processor.process({
type: 'url',
data: 'https://example.com/article',
schema: 'article'
});
```
## Plugin System
Create platform-specific plugins for the web scraper:
```typescript
import { WebScraperPlugin } from '@weavebot/core';
class MyPlatformPlugin implements WebScraperPlugin {
name = 'my-platform';
canHandle(url: string): boolean {
return url.includes('myplatform.com');
}
getConfig(url: string) {
return {
strategy: 'spa',
waitSelectors: ['.content-loaded'],
timeout: 10000
};
}
}
scraper.registerPlugin(new MyPlatformPlugin());
```
## Storage Adapters
Implement the generic storage interface for your backend:
```typescript
import { StorageAdapter } from '@weavebot/core';
class MyStorageAdapter implements StorageAdapter {
async initialize(config) { /* ... */ }
async create(collection, data) { /* ... */ }
async read(collection, id) { /* ... */ }
async update(collection, id, data) { /* ... */ }
async delete(collection, id) { /* ... */ }
async query(collection, filter) { /* ... */ }
}
processor.addStorage('my-storage', new MyStorageAdapter());
```
## Documentation
For complete documentation, visit the [GitHub repository](https://github.com/weavebot/library).
## License
MIT