UNPKG

aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

454 lines (361 loc) 15.5 kB
# Agent Specification: Acquisition Agent ## 1. Agent Overview | Attribute | Value | |-----------|-------| | **Name** | Acquisition Agent | | **ID** | research-acquisition-agent | | **Purpose** | Download research papers, extract metadata, validate FAIR compliance, and assign persistent identifiers | | **Lifecycle Stage** | Acquisition (Stage 2 of Research Framework) | | **Model** | sonnet | | **Version** | 1.0.0 | | **Status** | Draft | ### Description The Acquisition Agent transforms discovery results into a structured research corpus. It downloads PDFs from open access sources, extracts or retrieves metadata, assigns REF-XXX persistent identifiers, computes SHA-256 checksums for integrity verification, and validates FAIR (Findable, Accessible, Interoperable, Reusable) compliance. The agent handles paywalled papers through manual upload workflows and integrates with shared research-papers repositories to avoid duplicate downloads. ## 2. Capabilities ### Primary Capabilities | Capability | Description | NFR Reference | |------------|-------------|---------------| | PDF Download | Download papers from Semantic Scholar, arXiv, publisher sites | NFR-RF-A-01 | | Metadata Extraction | Extract title, authors, year, DOI from PDF or API | NFR-RF-A-02 | | FAIR Validation | Score papers on Findable, Accessible, Interoperable, Reusable criteria | NFR-RF-A-09 | | Checksum Computation | Generate SHA-256 hashes for integrity verification | NFR-RF-A-05 | | REF-XXX Assignment | Assign sequential persistent identifiers | BR-RF-A-001 | | Bulk Acquisition | Process multiple papers in parallel with rate limiting | NFR-RF-A-03 | ### Secondary Capabilities | Capability | Description | |------------|-------------| | Manual Upload Handling | Accept user-provided PDFs for paywalled papers | | Shared Corpus Deduplication | Create symlinks to existing papers in shared repository | | Format Validation | Verify PDF format via magic bytes check | | Progress Reporting | Real-time progress updates for bulk operations | ## 3. Tools ### Required Tools | Tool | Purpose | Permission | |------|---------|------------| | Bash | Execute downloads, compute checksums | Execute | | Read | Access acquisition queue, existing metadata | Read | | Write | Save PDFs, metadata JSON, reports | Write | | Glob | Find existing papers for deduplication | Read | | Grep | Search metadata for duplicates | Read | ### External APIs | API | Endpoint | Purpose | Auth | |-----|----------|---------|------| | Semantic Scholar | `api.semanticscholar.org` | Paper metadata, open access URLs | None | | arXiv | `arxiv.org/pdf/` | Direct PDF download | None | | CrossRef | `api.crossref.org` | DOI resolution | None | | Unpaywall | `api.unpaywall.org` | Open access detection | Email header | ### System Tools | Tool | Purpose | |------|---------| | `curl` / `wget` | HTTP downloads with resume support | | `sha256sum` | Checksum computation | | `file` | MIME type validation | | `pdftotext` | PDF text extraction | ## 4. Triggers ### Automatic Triggers | Trigger | Condition | Action | |---------|-----------|--------| | Discovery Complete | Acquisition queue populated (UC-RF-001) | Start bulk acquisition | | Workflow Stage | UC-RF-008 initiates Stage 2 | Process workflow queue | ### Manual Triggers | Trigger | Command | Description | |---------|---------|-------------| | Single Acquisition | `aiwg research acquire REF-XXX` | Acquire specific paper | | Bulk from Queue | `aiwg research acquire --from-queue` | Process entire queue | | Manual Upload | `aiwg research acquire --upload /path/to/file.pdf` | Add local PDF | | Retry Failed | `aiwg research acquire --retry-failed` | Retry failed downloads | ### Event Triggers | Event | Source | Action | |-------|--------|--------| | Paper Selected | Discovery Agent | Add to acquisition queue | | Quality Gate Failed | Workflow Agent | Acquire additional sources | ## 5. Inputs/Outputs ### Inputs | Input | Format | Source | Validation | |-------|--------|--------|------------| | Acquisition Queue | JSON | `.aiwg/research/discovery/acquisition-queue.json` | Valid paper IDs | | Paper IDs | Array of strings | Command arguments | Valid IDs in queue | | Manual PDF Path | File path | User input | File exists, valid PDF | | Manual Metadata | JSON object | User input (if extraction fails) | Required fields present | ### Outputs | Output | Format | Location | Retention | |--------|--------|----------|-----------| | PDF Files | Binary PDF | `.aiwg/research/sources/pdfs/{REF-XXX}-{slug}.pdf` | Permanent | | Metadata JSON | JSON | `.aiwg/research/sources/metadata/{REF-XXX}-metadata.json` | Permanent | | Acquisition Report | Markdown | `.aiwg/research/sources/acquisition-report-{timestamp}.md` | Permanent | | Checksums | Text | `.aiwg/research/sources/checksums.txt` | Permanent | ### Output Schema: Metadata JSON ```json { "ref_id": "REF-025", "title": "OAuth 2.0 Security Best Practices", "title_slug": "oauth-2-security-best-practices", "authors": [ {"name": "Smith, John", "affiliation": "Stanford University"}, {"name": "Doe, Jane", "affiliation": "MIT"} ], "year": 2023, "venue": "ACM Conference on Computer and Communications Security (CCS)", "venue_tier": "A*", "doi": "10.1145/3576915.3623456", "abstract": "This paper presents security best practices for OAuth 2.0...", "license": "CC-BY-4.0", "url": "https://www.semanticscholar.org/paper/abc123def456", "pdf_url": "https://arxiv.org/pdf/2301.12345.pdf", "citations": 42, "acquisition_timestamp": "2026-01-25T14:30:00Z", "acquisition_source": "semantic-scholar-api", "fair_score": { "findable": 90, "accessible": 100, "interoperable": 95, "reusable": 90, "overall": 94 }, "checksum_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "file_size_bytes": 2457600, "provenance": { "discovery_query": "OAuth2 security best practices", "discovery_timestamp": "2026-01-25T10:00:00Z", "selected_by": "user-manual-selection" } } ``` ## 6. Dependencies ### Agent Dependencies | Agent | Relationship | Interaction | |-------|--------------|-------------| | Discovery Agent | Upstream | Receives acquisition queue | | Documentation Agent | Downstream | Provides PDFs and metadata for UC-RF-003 | | Workflow Agent | Orchestrator | Receives task assignments, reports completion | | Provenance Agent | Observer | Logs all acquisition operations | ### Service Dependencies | Service | Purpose | Fallback | |---------|---------|----------| | HTTP Downloads | Paper acquisition | Manual upload | | Semantic Scholar API | Metadata retrieval | PDF extraction | | File System | PDF and metadata storage | Abort if unavailable | ### Data Dependencies | Data | Location | Required | |------|----------|----------| | Acquisition Queue | `.aiwg/research/discovery/acquisition-queue.json` | Yes | | REF Counter | `.aiwg/research/sources/ref-counter.txt` | Yes (created if missing) | | Shared Corpus | `/tmp/research-papers/sources/` | Optional | ## 7. Configuration Options ### Agent Configuration ```yaml # .aiwg/research/config/research-acquisition-agent.yaml acquisition_agent: # Download Configuration download: timeout_seconds: 60 retry_attempts: 3 retry_backoff_ms: [5000, 10000, 20000] concurrent_downloads: 5 user_agent: "AIWG Research Framework/1.0 (research tool)" # Storage Configuration storage: pdf_directory: ".aiwg/research/sources/pdfs" metadata_directory: ".aiwg/research/sources/metadata" max_file_size_mb: 100 permissions: "644" # FAIR Scoring Weights fair_scoring: findable: doi_present: 40 metadata_complete: 10 # per field: title, authors, year, venue, abstract accessible: persistent_url: 50 clear_license: 50 interoperable: json_format: 50 schema_compliance: 50 reusable: license_permits_reuse: 50 provenance_documented: 50 # Shared Corpus Integration shared_corpus: enabled: true path: "/tmp/research-papers/sources" symlink_enabled: true # REF-XXX Format ref_format: prefix: "REF" digits: 3 # REF-001, REF-002, etc. ``` ### Environment Variables | Variable | Purpose | Default | |----------|---------|---------| | `AIWG_RESEARCH_DOWNLOAD_TIMEOUT` | Download timeout in seconds | 60 | | `AIWG_RESEARCH_CONCURRENT_DOWNLOADS` | Max parallel downloads | 5 | | `AIWG_RESEARCH_SHARED_CORPUS` | Path to shared paper repository | None | ## 8. Error Handling ### Error Categories | Error Type | Severity | Handling Strategy | |------------|----------|-------------------| | Download Timeout | Warning | Retry 3x with backoff | | 404 Not Found | Warning | Log, skip, continue with next | | Invalid PDF | Warning | Flag for manual upload | | Disk Full | Error | Abort, cleanup, notify user | | Metadata Extraction Failed | Warning | Prompt for manual input | | Checksum Mismatch | Error | Re-download and verify | ### Error Response Template ```json { "error_code": "ACQUISITION_DOWNLOAD_FAILED", "severity": "warning", "paper_id": "abc123def456", "message": "Failed to download PDF: Network timeout after 60 seconds", "retry_count": 3, "remediation": "Provide manual PDF upload or skip this paper", "next_action": "Continue with remaining papers" } ``` ### Recovery Procedures | Scenario | Procedure | |----------|-----------| | Partial acquisition failure | Save successful downloads, log failures, allow retry | | Corrupted download | Delete partial file, retry from scratch | | Metadata extraction failed | Use API metadata, prompt user if unavailable | | Shared corpus unavailable | Fall back to local-only storage | ## 9. Metrics/Observability ### Performance Metrics | Metric | Target | Measurement | |--------|--------|-------------| | Download time per paper | <60 seconds median | Timer from request to save | | Metadata extraction time | <10 seconds | Timer for extraction | | Bulk throughput | 5 concurrent downloads | Active downloads | | Success rate | >90% | Successful / total attempted | ### Logging | Log Level | Events | |-----------|--------| | INFO | Acquisition start, paper acquired, completion | | DEBUG | Download progress, metadata extraction steps | | WARNING | Retry triggered, FAIR score low, paywalled paper | | ERROR | Download failed, disk error, validation failure | ### Telemetry ```json { "event": "acquisition_complete", "timestamp": "2026-01-25T14:30:00Z", "metrics": { "papers_attempted": 25, "papers_acquired": 23, "papers_failed": 2, "total_size_mb": 115.5, "average_download_time_ms": 8500, "fair_score_average": 87 } } ``` ## 10. Example Usage ### Single Paper Acquisition ```bash # Acquire a specific paper from queue aiwg research acquire REF-025 # Output: # Acquiring REF-025: "OAuth 2.0 Security Best Practices" # Downloading from: https://arxiv.org/pdf/2301.12345.pdf # Download complete: 2.4 MB # Validating PDF format... OK # Extracting metadata... OK # Computing SHA-256 checksum... OK # FAIR validation: 94/100 (High) # Saved: .aiwg/research/sources/pdfs/REF-025-oauth-2-security-best-practices.pdf # Metadata: .aiwg/research/sources/metadata/REF-025-metadata.json ``` ### Bulk Acquisition from Queue ```bash # Acquire all papers in queue aiwg research acquire --from-queue # Output: # Acquisition queue: 25 papers # Processing 5 concurrent downloads... # [1/25] REF-001: Downloading... OK (1.2 MB) # [2/25] REF-002: Downloading... OK (3.4 MB) # [3/25] REF-003: PAYWALLED - Manual upload required # ... # [25/25] REF-025: Downloading... OK (2.4 MB) # # Acquisition Summary: # - Acquired: 23/25 (92%) # - Paywalled: 2 (manual upload required) # - Total size: 115.5 MB # - Average FAIR score: 87/100 # # Report saved: .aiwg/research/sources/acquisition-report-2026-01-25T14-30-00.md ``` ### Manual PDF Upload ```bash # Upload paywalled paper manually aiwg research acquire --upload /tmp/oauth-paper.pdf --ref REF-003 # Output: # Validating PDF format... OK # Extracting metadata from PDF... # - Title: "OAuth 2.0 Authorization Framework" # - Authors: [auto-extracted] # - Year: 2023 # Confirm metadata? (y/n/edit): y # Assigning identifier: REF-003 # Computing checksum... OK # FAIR validation: 72/100 (Moderate - missing license info) # Saved: .aiwg/research/sources/pdfs/REF-003-oauth-2-authorization-framework.pdf ``` ### Shared Corpus Deduplication ```bash # Paper already in shared corpus aiwg research acquire REF-042 # Output: # Checking shared corpus at /tmp/research-papers/sources/... # Match found: Paper already acquired (DOI: 10.1145/example) # Create symlink to shared corpus? (y/n): y # Symlink created: .aiwg/research/sources/pdfs/REF-042.pdf -> /tmp/research-papers/sources/abc123.pdf # Reusing existing metadata (no re-download) ``` ## 11. Related Use Cases | Use Case | Relationship | Description | |----------|--------------|-------------| | UC-RF-002 | Primary | Acquire Research Source with FAIR Validation | | UC-RF-001 | Upstream | Discover Research Papers (provides queue) | | UC-RF-003 | Downstream | Document Research Paper (receives PDFs) | | UC-RF-008 | Orchestrated | Execute Research Workflow (Stage 2) | ## 12. Implementation Notes ### Architecture Considerations 1. **Parallel Download Management**: Use worker pool for concurrent downloads 2. **Transactional Acquisition**: All-or-nothing per paper (no partial saves) 3. **Idempotent Operations**: Re-acquiring same paper updates metadata, doesn't duplicate 4. **Storage Efficiency**: Symlinks for shared corpus, deduplication by DOI ### Performance Optimizations 1. **Streaming Downloads**: Stream large files to avoid memory issues 2. **Parallel Checksums**: Compute checksum while downloading (stream hash) 3. **Batch Metadata Retrieval**: Query API for multiple papers in one request 4. **Resume Support**: Resume interrupted downloads when supported ### Security Considerations 1. **URL Validation**: Only download from whitelisted domains 2. **File Type Verification**: Magic bytes check, not just extension 3. **Checksum Verification**: Detect corrupted or tampered files 4. **Copyright Compliance**: Respect publisher terms, prioritize open access ### Testing Strategy | Test Type | Coverage Target | Focus Areas | |-----------|-----------------|-------------| | Unit Tests | 80% | Metadata extraction, FAIR scoring, REF assignment | | Integration Tests | 70% | Download handling, file I/O, API interaction | | E2E Tests | Key workflows | Full acquisition from queue to storage | ### Known Limitations 1. **Paywalled Papers**: Cannot auto-download; require manual upload 2. **Rate Limits**: Some publishers block rapid downloads 3. **PDF Quality**: Scanned PDFs may have poor metadata extraction 4. **Large Files**: Papers >100MB may timeout on slow connections --- ## References - @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-002-acquire-research-source.md - @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/vision-document.md - Section 7.1 (Acquisition Management) - @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/initial-risk-assessment.md - T-04 (Copyright Compliance) - [FAIR Principles](https://www.go-fair.org/fair-principles/) --- ## Document Metadata **Version:** 1.0 (Draft) **Status:** DRAFT - Awaiting Review **Created:** 2026-01-25 **Last Updated:** 2026-01-25 **Owner:** Agent Designer (Research Framework Team)