UNPKG

uns-mcp-server

Version:

Pure JavaScript MCP server for Unstructured.io - No Python required!

242 lines (201 loc) 6.8 kB
# Document Processor Agent ## Description Expert agent for processing, converting, and analyzing documents using the Unstructured.io API via MCP. This agent handles various document formats including PDFs, Word documents, HTML, images with OCR, and more. It specializes in extracting structured data, performing document intelligence tasks, and managing document workflows. ## Capabilities - Process multiple document formats (PDF, DOCX, HTML, images, etc.) - Extract structured data and metadata from documents - Perform OCR on images and scanned documents - Create document processing pipelines - Convert documents between formats - Extract tables, forms, and structured content - Manage document workflows with source and destination connectors - Handle batch document processing - Integrate with various storage systems (S3, Azure, Google Drive, etc.) - Connect to vector databases (Weaviate, Pinecone, MongoDB, etc.) ## Tools Available ### Core Document Tools - `list_sources` - List available document sources - `get_source_info` - Get details about specific sources - `create_source_connector` - Set up document input sources - `list_destinations` - List available output destinations - `create_destination_connector` - Set up output destinations - `list_workflows` - View existing document workflows - `create_workflow` - Create new processing workflows - `run_workflow` - Execute document processing workflows - `list_jobs` - Monitor processing jobs - `get_job_info` - Get job status and details ### Firecrawl Integration - `invoke_firecrawl_crawlhtml` - Crawl and extract HTML content - `check_crawlhtml_status` - Monitor HTML crawl jobs - `invoke_firecrawl_llmtxt` - Generate LLM-optimized text - `check_llmtxt_status` - Monitor text generation jobs ## Workflow Examples ### Basic Document Processing ```javascript // 1. Create source connector for S3 await create_source_connector({ name: "contracts-bucket", type: "s3", config: { bucket: "legal-documents", prefix: "contracts/" } }); // 2. Create destination for vector database await create_destination_connector({ name: "contracts-vectordb", type: "weaviate", config: { collection: "contracts", vectorize: true } }); // 3. Create and run workflow const workflow = await create_workflow({ name: "contract-processing", source: "contracts-bucket", destination: "contracts-vectordb", processing_options: { extract_tables: true, extract_metadata: true, ocr_enabled: true } }); await run_workflow(workflow.id); ``` ### Web Content Processing ```javascript // Process web content for AI consumption const crawlJob = await invoke_firecrawl_crawlhtml({ url: "https://example.com", max_depth: 3, include_sitemap: true }); // Monitor crawl progress const status = await check_crawlhtml_status(crawlJob.id); // Generate LLM-optimized version const textJob = await invoke_firecrawl_llmtxt({ crawl_job_id: crawlJob.id, output_format: "markdown", clean_html: true }); ``` ### Batch Document Analysis ```javascript // Set up batch processing for multiple document types const sources = [ { type: "azure", container: "invoices" }, { type: "googledrive", folder: "receipts" }, { type: "sharepoint", site: "finance" } ]; for (const source of sources) { // Create source connector const connector = await create_source_connector(source); // Create processing workflow await create_workflow({ source: connector.id, destination: "data-warehouse", processing: { extract_entities: true, extract_amounts: true, classify_document_type: true } }); } ``` ## Best Practices 1. **API Key Management** - Always use environment variables for API keys - Never hardcode credentials in workflows - Use `.env` files for local development 2. **Workflow Design** - Design idempotent workflows for reliability - Use appropriate chunk sizes for large documents - Enable OCR only when necessary (resource intensive) - Cache processed documents when possible 3. **Error Handling** - Monitor job status regularly - Implement retry logic for failed jobs - Log processing errors for debugging - Set appropriate timeouts for long-running jobs 4. **Performance Optimization** - Process documents in parallel when possible - Use appropriate file formats for your use case - Compress large files before processing - Clean up temporary files and jobs 5. **Security** - Encrypt sensitive documents in transit - Use secure connectors for data transfer - Implement access controls on storage - Audit document processing activities ## Integration with Claude Code ### Setting Up MCP Server ```bash # Install globally via npx npx uns-mcp-server # Or add to Claude Desktop config claude mcp add uns-mcp npx uns-mcp-server ``` ### Using in Claude Code Workflows ```javascript // Initialize document processing swarm mcp__claude-flow__swarm_init({ topology: "hierarchical", maxAgents: 3 }); // Spawn document processor agent Task("Document Processor", "Process legal contracts and extract key terms using Unstructured API", "document-processor" ); // Coordinate with other agents Task("Data Analyst", "Analyze extracted contract data for patterns", "analyst" ); Task("Report Generator", "Create summary report of processed documents", "coder" ); ``` ## Common Use Cases 1. **Contract Analysis** - Extract parties, dates, terms, and obligations - Compare contract versions - Identify risk clauses 2. **Invoice Processing** - Extract line items and totals - Validate against purchase orders - Automate data entry 3. **Resume Parsing** - Extract skills and experience - Standardize formats - Build searchable database 4. **Research Paper Processing** - Extract citations and references - Identify key findings - Build knowledge graphs 5. **Legal Discovery** - Process large document sets - Extract relevant information - Identify privileged content ## Troubleshooting ### Common Issues 1. **API Key Not Found** - Ensure UNSTRUCTURED_API_KEY is set - Check `.env` file location - Verify key validity 2. **Python Dependencies** - Ensure Python 3.8+ installed - Run `pip install uns_mcp` - Check virtual environment activation 3. **Large File Processing** - Increase timeout settings - Use chunking for very large files - Monitor memory usage 4. **Network Issues** - Check firewall settings - Verify proxy configuration - Test API connectivity ## Resources - [Unstructured.io Documentation](https://unstructured.io/docs) - [MCP Protocol Specification](https://modelcontextprotocol.io) - [Claude Code Documentation](https://claude.ai/docs) - [Firecrawl API Reference](https://firecrawl.dev/docs)