aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
1,477 lines (1,211 loc) • 36.8 kB
Markdown
# API Integration Specifications
**Document Type**: Technical Specification
**Phase**: Elaboration
**Status**: Draft
**Version**: 1.0.0
**Last Updated**: 2026-01-25
## 1. API Overview
### 1.1 Purpose
Define integration patterns, data contracts, and operational requirements for external APIs used in the AIWG Research Framework.
### 1.2 Scope
This specification covers:
- **Research APIs**: Semantic Scholar, CrossRef, arXiv, Unpaywall
- **LLM APIs**: Claude (Anthropic), OpenAI
- **Optional Integrations**: Zotero, Neo4j, Obsidian
### 1.3 Integration Approach
**Architecture Pattern**: Service adapter layer with unified interface
```
┌─────────────────┐
│ CLI/MCP Server │
└────────┬────────┘
│
┌────────▼────────────────┐
│ Research Service Layer │
└────────┬────────────────┘
│
┌────┴─────┬──────────┬──────────┬───────────┐
│ │ │ │ │
┌───▼───┐ ┌──▼──┐ ┌───▼───┐ ┌───▼────┐ ┌──▼───┐
│ S2 API│ │ CREF│ │ arXiv │ │Unpaywall│ │ LLM │
└───────┘ └─────┘ └───────┘ └────────┘ └──────┘
```
**Key Principles**:
- Adapter pattern for each external API
- Unified error handling across all integrations
- Comprehensive caching to minimize API calls
- Rate limit awareness and compliance
- Graceful degradation when services unavailable
## 2. Semantic Scholar API
### 2.1 Overview
**Base URL**: `https://api.semanticscholar.org/graph/v1`
**Documentation**: https://api.semanticscholar.org/api-docs/
**Authentication**: Optional API key (recommended for production)
### 2.2 Endpoints
#### 2.2.1 Paper Search
**Endpoint**: `GET /paper/search`
**Query Parameters**:
```typescript
interface SearchParams {
query: string; // Search query
limit?: number; // Max results (default: 10, max: 100)
offset?: number; // Pagination offset
fields?: string; // Comma-separated field list
publicationTypes?: string; // Filter by type (JournalArticle, Conference, etc.)
year?: string; // Filter by year range (e.g., "2020-2023")
minCitationCount?: number; // Minimum citations
venue?: string; // Filter by venue/journal
openAccessPdf?: boolean; // Only papers with free PDFs
}
```
**Response Schema**:
```typescript
interface SearchResponse {
total: number;
offset: number;
next?: number;
data: Paper[];
}
interface Paper {
paperId: string;
externalIds?: {
DOI?: string;
ArXiv?: string;
MAG?: string;
CorpusId?: number;
};
title: string;
abstract?: string;
venue?: string;
year?: number;
citationCount?: number;
influentialCitationCount?: number;
isOpenAccess?: boolean;
openAccessPdf?: {
url: string;
status: string;
};
authors?: Author[];
publicationTypes?: string[];
publicationDate?: string;
journal?: {
name?: string;
pages?: string;
volume?: string;
};
fieldsOfStudy?: string[];
}
interface Author {
authorId: string;
name: string;
}
```
**Example Request**:
```bash
curl -H "x-api-key: ${S2_API_KEY}" \
"https://api.semanticscholar.org/graph/v1/paper/search?query=chain+of+thought+prompting&fields=title,abstract,authors,year,citationCount,openAccessPdf&limit=20"
```
#### 2.2.2 Paper Details
**Endpoint**: `GET /paper/{paper_id}`
**Path Parameters**:
- `paper_id`: Semantic Scholar ID, DOI, or ArXiv ID
**Query Parameters**:
```typescript
interface DetailsParams {
fields?: string; // Comma-separated field list
}
```
**Response Schema**: Same as `Paper` interface above
**Example Request**:
```bash
# By DOI
curl -H "x-api-key: ${S2_API_KEY}" \
"https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,abstract,citations,references"
# By ArXiv ID
curl -H "x-api-key: ${S2_API_KEY}" \
"https://api.semanticscholar.org/graph/v1/paper/ARXIV:2301.12345?fields=title,abstract"
```
#### 2.2.3 Citations
**Endpoint**: `GET /paper/{paper_id}/citations`
**Query Parameters**:
```typescript
interface CitationsParams {
fields?: string;
limit?: number; // Max: 1000
offset?: number;
}
```
**Response Schema**:
```typescript
interface CitationsResponse {
offset: number;
next?: number;
data: {
contexts: string[]; // Citation context snippets
intents: string[]; // Citation intent (background, methodology, result)
isInfluential: boolean;
citingPaper: Paper;
}[];
}
```
#### 2.2.4 Recommendations
**Endpoint**: `GET /paper/{paper_id}/recommendations`
**Query Parameters**:
```typescript
interface RecommendationsParams {
fields?: string;
limit?: number; // Max: 500
}
```
**Response Schema**:
```typescript
interface RecommendationsResponse {
recommendedPapers: Paper[];
}
```
### 2.3 Rate Limits
| Tier | Rate Limit | Authentication |
|------|------------|----------------|
| Public | 100 requests/second | None |
| Authenticated | 1000 requests/second | API key required |
**Rate Limit Headers**:
```
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1234567890
```
**Implementation Strategy**:
- Token bucket algorithm with per-second refill
- Exponential backoff on 429 responses
- Queue requests when approaching limit
### 2.4 Error Handling
**HTTP Status Codes**:
| Code | Meaning | Action |
|------|---------|--------|
| 200 | Success | Process response |
| 400 | Bad request | Validate query parameters |
| 404 | Paper not found | Try alternative IDs (DOI → ArXiv) |
| 429 | Rate limit exceeded | Exponential backoff, retry |
| 500 | Server error | Retry with exponential backoff (max 3 attempts) |
| 503 | Service unavailable | Retry after 5s, 10s, 30s |
**Error Response Schema**:
```typescript
interface ErrorResponse {
error: string;
message: string;
}
```
**Retry Policy**:
```typescript
const retryConfig = {
maxAttempts: 3,
backoffMultiplier: 2,
initialDelay: 1000, // 1s
maxDelay: 30000, // 30s
retryableStatuses: [429, 500, 502, 503, 504]
};
```
## 3. CrossRef API
### 3.1 Overview
**Base URL**: `https://api.crossref.org`
**Documentation**: https://api.crossref.org/swagger-ui/
**Authentication**: None required (polite pool encouraged)
### 3.2 Endpoints
#### 3.2.1 Metadata Lookup by DOI
**Endpoint**: `GET /works/{doi}`
**Headers**:
```
User-Agent: AIWG-Research/1.0 (mailto:your-email@example.com)
```
**Response Schema**:
```typescript
interface CrossRefWork {
status: "ok";
message: {
DOI: string;
title: string[];
author?: {
given?: string;
family: string;
ORCID?: string;
}[];
"published-print"?: { "date-parts": number[][] };
"published-online"?: { "date-parts": number[][] };
"container-title"?: string[];
volume?: string;
issue?: string;
page?: string;
publisher?: string;
type?: string; // journal-article, proceedings-article, etc.
abstract?: string;
"is-referenced-by-count"?: number; // Citation count
reference?: {
key: string;
DOI?: string;
"article-title"?: string;
}[];
};
}
```
**Example Request**:
```bash
curl -H "User-Agent: AIWG-Research/1.0 (mailto:your@email.com)" \
"https://api.crossref.org/works/10.1145/3491102.3517582"
```
#### 3.2.2 Work Search
**Endpoint**: `GET /works`
**Query Parameters**:
```typescript
interface WorkSearchParams {
"query.title"?: string;
"query.author"?: string;
"query.bibliographic"?: string; // General search
filter?: string; // e.g., "from-pub-date:2020,until-pub-date:2023"
rows?: number; // Results per page (max: 1000)
offset?: number;
sort?: string; // relevance, published, is-referenced-by-count
order?: "asc" | "desc";
}
```
**Response Schema**:
```typescript
interface WorkSearchResponse {
status: "ok";
"message-type": "work-list";
message: {
"total-results": number;
items: CrossRefWork["message"][];
};
}
```
### 3.3 Rate Limits
| Pool | Rate Limit | Requirements |
|------|------------|--------------|
| Standard | ~50 req/sec | None |
| Polite | Higher priority | User-Agent with email |
| Plus | No limit | Subscription |
**Polite Pool Requirements**:
- Include `User-Agent` header with email: `AIWG-Research/1.0 (mailto:your@email.com)`
- Better response times and priority
### 3.4 Error Handling
**HTTP Status Codes**:
| Code | Meaning | Action |
|------|---------|--------|
| 200 | Success | Process response |
| 404 | DOI not found | Log and skip |
| 429 | Rate limited | Exponential backoff |
| 503 | Temporary unavailable | Retry after delay |
**Fallback Strategy**:
- If CrossRef fails, attempt Semantic Scholar lookup by DOI
- Cache successful responses aggressively (TTL: 7 days)
## 4. arXiv API
### 4.1 Overview
**Base URL**: `http://export.arxiv.org/api/query`
**Documentation**: https://arxiv.org/help/api/
**Authentication**: None required
### 4.2 Endpoints
#### 4.2.1 Paper Search
**Endpoint**: `GET /api/query`
**Query Parameters**:
```typescript
interface ArXivSearchParams {
search_query: string; // e.g., "all:transformer" or "ti:attention"
start?: number; // Pagination start index
max_results?: number; // Max: 30000 per query
sortBy?: "relevance" | "lastUpdatedDate" | "submittedDate";
sortOrder?: "ascending" | "descending";
}
```
**Search Query Syntax**:
```
ti: Title
au: Author
abs: Abstract
cat: Category (e.g., cs.AI, cs.CL)
all: All fields
Examples:
- "ti:neural+architecture+search"
- "au:lecun+AND+cat:cs.CV"
- "abs:reinforcement+learning+AND+submittedDate:[202001010000+TO+202312312359]"
```
**Response Schema** (Atom XML):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>ArXiv Query: search_query=all:transformer</title>
<opensearch:totalResults>15234</opensearch:totalResults>
<opensearch:startIndex>0</opensearch:startIndex>
<opensearch:itemsPerPage>10</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/abs/1706.03762v7</id>
<title>Attention Is All You Need</title>
<summary>Abstract text...</summary>
<author>
<name>Ashish Vaswani</name>
</author>
<published>2017-06-12T17:57:34Z</published>
<updated>2023-08-02T11:01:03Z</updated>
<link href="http://arxiv.org/abs/1706.03762v7" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/1706.03762v7" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.CL"/>
<category term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>
</feed>
```
**TypeScript Interface** (after parsing):
```typescript
interface ArXivPaper {
id: string; // arXiv ID (e.g., "1706.03762v7")
title: string;
summary: string; // Abstract
authors: string[];
published: string; // ISO 8601
updated: string;
pdfUrl: string;
categories: string[]; // e.g., ["cs.CL", "cs.LG"]
primaryCategory: string;
}
interface ArXivSearchResponse {
totalResults: number;
startIndex: number;
itemsPerPage: number;
entries: ArXivPaper[];
}
```
**Example Request**:
```bash
curl "http://export.arxiv.org/api/query?search_query=all:chain+of+thought&start=0&max_results=10&sortBy=submittedDate&sortOrder=descending"
```
#### 4.2.2 Paper Download
**PDF URL Pattern**: `http://arxiv.org/pdf/{arxiv_id}.pdf`
**Example**:
```bash
curl -o paper.pdf "http://arxiv.org/pdf/1706.03762.pdf"
```
### 4.3 Rate Limits
**Official Limits**:
- Max 1 request per 3 seconds
- Bulk downloads discouraged (use S3 instead)
**Implementation**:
```typescript
const ARXIV_DELAY_MS = 3000;
class ArXivClient {
private lastRequestTime = 0;
async search(params: ArXivSearchParams): Promise<ArXivSearchResponse> {
await this.respectRateLimit();
// ... make request
}
private async respectRateLimit() {
const now = Date.now();
const timeSinceLastRequest = now - this.lastRequestTime;
if (timeSinceLastRequest < ARXIV_DELAY_MS) {
await sleep(ARXIV_DELAY_MS - timeSinceLastRequest);
}
this.lastRequestTime = Date.now();
}
}
```
### 4.4 Error Handling
**HTTP Status Codes**:
| Code | Meaning | Action |
|------|---------|--------|
| 200 | Success | Parse Atom XML |
| 400 | Malformed query | Validate and fix query syntax |
| 503 | Service unavailable | Retry after 30s |
**XML Parsing Errors**:
- Use robust XML parser (e.g., `fast-xml-parser`)
- Handle malformed feeds gracefully
- Log parsing errors with original XML
## 5. Unpaywall API
### 5.1 Overview
**Base URL**: `https://api.unpaywall.org/v2`
**Documentation**: https://unpaywall.org/products/api
**Authentication**: Email required (append `?email=your@email.com`)
### 5.2 Endpoints
#### 5.2.1 DOI Lookup
**Endpoint**: `GET /v2/{doi}`
**Query Parameters**:
```typescript
interface UnpaywallParams {
email: string; // Required
}
```
**Response Schema**:
```typescript
interface UnpaywallResponse {
doi: string;
doi_url: string;
title: string;
is_oa: boolean; // Is open access?
oa_status: "gold" | "green" | "hybrid" | "bronze" | "closed";
best_oa_location?: {
url: string;
url_for_pdf?: string;
url_for_landing_page: string;
version: "publishedVersion" | "acceptedVersion" | "submittedVersion";
license?: string;
host_type: "publisher" | "repository";
};
oa_locations: {
url: string;
url_for_pdf?: string;
version: string;
license?: string;
host_type: string;
}[];
published_date?: string;
year?: number;
journal_name?: string;
journal_issns?: string;
publisher?: string;
authors?: string;
}
```
**Example Request**:
```bash
curl "https://api.unpaywall.org/v2/10.1038/nature12373?email=your@email.com"
```
### 5.3 Rate Limits
**Official Limits**: 100,000 requests per day
**Implementation**:
- No per-second limit specified
- Reasonable delay (500ms) between requests
- Cache responses aggressively (TTL: 30 days - OA status rarely changes)
### 5.4 Integration with Acquisition
**Workflow**:
1. After paper discovered via Semantic Scholar/arXiv
2. If DOI available and no PDF yet, query Unpaywall
3. If `is_oa: true` and `best_oa_location.url_for_pdf` exists:
- Download PDF from `url_for_pdf`
- Store in `.aiwg/research/pdfs/{paper_id}.pdf`
- Record license info in metadata
**Error Handling**:
| Scenario | Action |
|----------|--------|
| DOI not found (404) | Skip, not all papers indexed |
| No OA version | Mark as "access restricted" in metadata |
| PDF download fails | Retry once, then mark as "download failed" |
## 6. LLM Integration
### 6.1 Overview
**Purpose**: Summarization, synthesis, concept extraction from research papers
**Supported Providers**:
- Anthropic Claude (primary)
- OpenAI GPT-4 (fallback)
### 6.2 Claude API
**Base URL**: `https://api.anthropic.com/v1`
**Documentation**: https://docs.anthropic.com/claude/reference/
**Authentication**: API key in header `x-api-key`
#### 6.2.1 Messages Endpoint
**Endpoint**: `POST /v1/messages`
**Request Schema**:
```typescript
interface ClaudeRequest {
model: string; // "claude-opus-4-6" or "claude-sonnet-4-6"
max_tokens: number;
messages: {
role: "user" | "assistant";
content: string;
}[];
system?: string; // System prompt
temperature?: number; // 0.0 - 1.0
}
```
**Response Schema**:
```typescript
interface ClaudeResponse {
id: string;
type: "message";
role: "assistant";
content: {
type: "text";
text: string;
}[];
model: string;
stop_reason: "end_turn" | "max_tokens" | "stop_sequence";
usage: {
input_tokens: number;
output_tokens: number;
};
}
```
**Example Request**:
```bash
curl -X POST https://api.anthropic.com/v1/messages \
-H "x-api-key: ${ANTHROPIC_API_KEY}" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 2048,
"messages": [{
"role": "user",
"content": "Summarize this paper abstract in 3 bullet points:\n\n{abstract}"
}]
}'
```
#### 6.2.2 Rate Limits
**Tier-based** (varies by subscription):
- Free tier: ~5 req/min
- Pro: ~50 req/min
- Enterprise: Custom
**Token Limits**:
- Opus 4.5: 200K input, 16K output
- Sonnet 4.5: 200K input, 16K output
**Cost** (as of Jan 2025):
- Opus 4.5: $15/MTok input, $75/MTok output
- Sonnet 4.5: $3/MTok input, $15/MTok output
#### 6.2.3 Summarization Prompt Templates
**Abstract Summary** (3-bullet format):
```typescript
const ABSTRACT_SUMMARY_PROMPT = `
Summarize this research paper abstract in exactly 3 bullet points.
Each bullet should be one concise sentence capturing key contributions.
Abstract:
{abstract}
Format:
- First key point
- Second key point
- Third key point
`.trim();
```
**Full Paper Summary** (with sections):
```typescript
const FULL_PAPER_SUMMARY_PROMPT = `
Summarize this research paper with the following structure:
1. Core Contribution (1 sentence)
2. Key Findings (3 bullet points)
3. Methodology (2-3 sentences)
4. Implications (2-3 sentences)
Paper text:
{paper_text}
`.trim();
```
**Concept Extraction**:
```typescript
const CONCEPT_EXTRACTION_PROMPT = `
Extract key concepts, methods, and terminology from this paper.
Return as JSON:
{
"core_concepts": ["concept1", "concept2", ...],
"methods": ["method1", "method2", ...],
"datasets": ["dataset1", ...],
"metrics": ["metric1", ...]
}
Paper abstract:
{abstract}
`.trim();
```
### 6.3 RAG Pattern for Grounding
**Purpose**: Prevent hallucination, ensure summaries accurate to source text
**Implementation**:
```typescript
interface RAGContext {
paperTitle: string;
paperAbstract: string;
paperSections?: {
title: string;
content: string;
}[];
citationCount?: number;
year?: number;
}
function buildRAGPrompt(context: RAGContext, task: string): string {
return `
You are summarizing academic research. Stay strictly faithful to the source text.
Do not infer, speculate, or add information not present in the text.
Paper: ${context.paperTitle}
Year: ${context.year || "Unknown"}
Citations: ${context.citationCount || "Unknown"}
Abstract:
${context.paperAbstract}
${context.paperSections ? `
Full Text Sections:
${context.paperSections.map(s => `## ${s.title}\n${s.content}`).join("\n\n")}
` : ""}
Task: ${task}
Requirements:
- Use only information from the text above
- Cite specific claims if asked
- If information is unclear or missing, say so explicitly
`.trim();
}
```
### 6.4 OpenAI Fallback
**Base URL**: `https://api.openai.com/v1`
**Authentication**: `Authorization: Bearer {api_key}`
**Model**: `gpt-4-turbo` or `gpt-4o`
**Request Schema**:
```typescript
interface OpenAIRequest {
model: string;
messages: {
role: "system" | "user" | "assistant";
content: string;
}[];
max_tokens?: number;
temperature?: number;
}
```
**Rate Limits**: Tier-based (similar to Claude)
**Cost** (as of Jan 2025):
- GPT-4 Turbo: $10/MTok input, $30/MTok output
- GPT-4o: $5/MTok input, $15/MTok output
**Fallback Strategy**:
1. Primary: Claude Sonnet 4.5 (lower cost)
2. Fallback: OpenAI GPT-4o (if Claude unavailable/rate limited)
3. Ultimate fallback: Local model (if privacy required)
### 6.5 Token Management
**Estimation**:
```typescript
function estimateTokens(text: string): number {
// Rough estimate: 1 token ≈ 4 characters
return Math.ceil(text.length / 4);
}
function truncateToTokenLimit(text: string, maxTokens: number): string {
const maxChars = maxTokens * 4;
if (text.length <= maxChars) return text;
return text.slice(0, maxChars - 100) + "\n\n[Truncated...]";
}
```
**Cost Tracking**:
```typescript
interface UsageMetrics {
inputTokens: number;
outputTokens: number;
estimatedCost: number;
model: string;
timestamp: string;
}
// Store in .aiwg/research/cache/llm-usage.jsonl
```
### 6.6 Error Handling
**Common Errors**:
| Error | Action |
|-------|--------|
| Rate limit (429) | Exponential backoff, switch to fallback provider |
| Token limit exceeded | Truncate input, summarize in chunks |
| API key invalid (401) | Fail fast, prompt user to check config |
| Service unavailable (503) | Retry 3x with backoff, then fail gracefully |
## 7. Caching Strategy
### 7.1 Overview
**Purpose**: Minimize API calls, reduce costs, improve performance
**Storage Location**: `.aiwg/research/cache/`
**Structure**:
```
.aiwg/research/cache/
├── semantic-scholar/
│ ├── papers/ # {paper_id}.json
│ ├── citations/ # {paper_id}-citations.json
│ └── recommendations/ # {paper_id}-recs.json
├── crossref/
│ └── works/ # {doi_encoded}.json
├── arxiv/
│ └── papers/ # {arxiv_id}.json
├── unpaywall/
│ └── oa-status/ # {doi_encoded}.json
├── llm/
│ ├── summaries/ # {paper_id}-{hash}.json
│ └── usage.jsonl # Cost tracking
└── metadata.json # Cache metadata
```
### 7.2 Cache TTLs by Endpoint
| API | Endpoint | TTL | Rationale |
|-----|----------|-----|-----------|
| Semantic Scholar | Paper details | 7 days | Metadata rarely changes |
| Semantic Scholar | Citations | 1 day | Citation counts update frequently |
| Semantic Scholar | Recommendations | 7 days | Recommendations stable |
| CrossRef | Work metadata | 30 days | DOI metadata immutable |
| arXiv | Paper metadata | 30 days | arXiv IDs immutable |
| Unpaywall | OA status | 30 days | OA status rarely changes |
| LLM | Summaries | 90 days | Content deterministic given input |
### 7.3 Cache Key Generation
```typescript
function generateCacheKey(api: string, endpoint: string, params: Record<string, any>): string {
const normalized = JSON.stringify(params, Object.keys(params).sort());
const hash = crypto.createHash("sha256").update(normalized).digest("hex").slice(0, 16);
return `${api}/${endpoint}/${hash}.json`;
}
// Example: semantic-scholar/papers/a1b2c3d4e5f6g7h8.json
```
### 7.4 Cache Invalidation
**Manual Invalidation**:
```bash
# CLI command
aiwg research cache-clear --api semantic-scholar --older-than 7d
```
**Automatic Invalidation**:
```typescript
interface CacheEntry {
data: any;
timestamp: string;
ttl: number; // seconds
}
function isCacheValid(entry: CacheEntry): boolean {
const age = Date.now() - new Date(entry.timestamp).getTime();
return age < entry.ttl * 1000;
}
```
### 7.5 Cache Metadata
**`.aiwg/research/cache/metadata.json`**:
```json
{
"version": "1.0.0",
"lastCleanup": "2026-01-25T12:00:00Z",
"stats": {
"totalEntries": 1234,
"totalSizeBytes": 52428800,
"hitRate": 0.87,
"apis": {
"semantic-scholar": {
"entries": 800,
"sizeBytes": 35000000,
"hits": 1500,
"misses": 200
}
}
}
}
```
## 8. Error Handling
### 8.1 Unified Error Types
```typescript
enum ErrorType {
RATE_LIMIT = "RATE_LIMIT",
NOT_FOUND = "NOT_FOUND",
INVALID_REQUEST = "INVALID_REQUEST",
SERVICE_UNAVAILABLE = "SERVICE_UNAVAILABLE",
AUTHENTICATION = "AUTHENTICATION",
NETWORK = "NETWORK",
PARSE_ERROR = "PARSE_ERROR",
UNKNOWN = "UNKNOWN"
}
interface APIError {
type: ErrorType;
message: string;
api: string;
endpoint: string;
statusCode?: number;
retryable: boolean;
originalError?: Error;
}
```
### 8.2 Retry Policies
```typescript
interface RetryPolicy {
maxAttempts: number;
backoffMultiplier: number;
initialDelayMs: number;
maxDelayMs: number;
retryableErrors: ErrorType[];
}
const DEFAULT_RETRY_POLICY: RetryPolicy = {
maxAttempts: 3,
backoffMultiplier: 2,
initialDelayMs: 1000,
maxDelayMs: 30000,
retryableErrors: [
ErrorType.RATE_LIMIT,
ErrorType.SERVICE_UNAVAILABLE,
ErrorType.NETWORK
]
};
async function retryWithBackoff<T>(
fn: () => Promise<T>,
policy: RetryPolicy = DEFAULT_RETRY_POLICY
): Promise<T> {
let lastError: APIError;
for (let attempt = 1; attempt <= policy.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as APIError;
if (!policy.retryableErrors.includes(lastError.type)) {
throw error;
}
if (attempt < policy.maxAttempts) {
const delay = Math.min(
policy.initialDelayMs * Math.pow(policy.backoffMultiplier, attempt - 1),
policy.maxDelayMs
);
await sleep(delay);
}
}
}
throw lastError!;
}
```
### 8.3 Fallback Strategies
**Multi-Provider Fallback**:
```typescript
async function fetchPaperMetadata(doi: string): Promise<PaperMetadata> {
const strategies = [
() => semanticScholarClient.getPaperByDOI(doi),
() => crossRefClient.getWork(doi),
() => cache.get(`fallback/${doi}`) // Stale cache as last resort
];
let lastError: APIError;
for (const strategy of strategies) {
try {
return await strategy();
} catch (error) {
lastError = error as APIError;
console.warn(`Strategy failed: ${lastError.message}`);
}
}
throw new Error(`All fallback strategies exhausted for DOI: ${doi}`);
}
```
**Graceful Degradation**:
```typescript
interface EnrichedPaper {
core: PaperMetadata; // Required
citations?: Citation[]; // Optional
recommendations?: Paper[]; // Optional
summary?: Summary; // Optional
oaStatus?: OAStatus; // Optional
}
async function enrichPaper(paper: PaperMetadata): Promise<EnrichedPaper> {
const enriched: EnrichedPaper = { core: paper };
// Add optional enrichments, continue on failure
try {
enriched.citations = await getCitations(paper.id);
} catch (error) {
console.warn(`Failed to fetch citations: ${error.message}`);
}
try {
enriched.recommendations = await getRecommendations(paper.id);
} catch (error) {
console.warn(`Failed to fetch recommendations: ${error.message}`);
}
// ... etc
return enriched;
}
```
### 8.4 User-Facing Error Messages
```typescript
function formatUserError(error: APIError): string {
switch (error.type) {
case ErrorType.RATE_LIMIT:
return `Rate limit exceeded for ${error.api}. Please wait and try again.`;
case ErrorType.NOT_FOUND:
return `Resource not found. Please check the paper ID or DOI.`;
case ErrorType.AUTHENTICATION:
return `Authentication failed for ${error.api}. Please check your API key in config.`;
case ErrorType.SERVICE_UNAVAILABLE:
return `${error.api} is temporarily unavailable. Trying fallback...`;
default:
return `An error occurred: ${error.message}`;
}
}
```
## 9. Security Considerations
### 9.1 API Key Storage
**NEVER hard-code API keys in source code.**
**Recommended Approaches**:
1. **Environment Variables** (for development):
```bash
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export SEMANTIC_SCHOLAR_API_KEY="..."
```
2. **Config File** (for deployment):
```yaml
# .aiwg/research/config.yml (add to .gitignore!)
apis:
semantic_scholar:
api_key: "..."
anthropic:
api_key: "sk-ant-..."
openai:
api_key: "sk-..."
```
3. **Secure Token Load Pattern**:
```bash
# Load from secure file (mode 600)
bash <<'EOF'
API_KEY=$(cat ~/.config/aiwg/semantic-scholar-key)
curl -H "x-api-key: ${API_KEY}" "..."
EOF
```
**Validation**:
```typescript
function loadAPIKey(provider: string): string {
const key = process.env[`${provider.toUpperCase()}_API_KEY`];
if (!key) {
throw new Error(
`API key for ${provider} not found. Set ${provider.toUpperCase()}_API_KEY environment variable.`
);
}
return key;
}
```
### 9.2 Request Sanitization
**Input Validation**:
```typescript
function sanitizeSearchQuery(query: string): string {
// Remove control characters
let sanitized = query.replace(/[\x00-\x1F\x7F-\x9F]/g, "");
// Limit length
if (sanitized.length > 500) {
sanitized = sanitized.slice(0, 500);
}
// URL encode for API
return encodeURIComponent(sanitized);
}
```
**DOI Validation**:
```typescript
function isValidDOI(doi: string): boolean {
// DOI format: 10.XXXX/...
return /^10\.\d{4,}(\.\d+)*\/[^\s]+$/.test(doi);
}
```
**ArXiv ID Validation**:
```typescript
function isValidArXivID(id: string): boolean {
// Old format: archive/YYMMNNN or archive/YYMMNNNvN
// New format: YYMM.NNNNN or YYMM.NNNNNvN
return /^(\w+\/\d{7}|\d{4}\.\d{4,5})(v\d+)?$/.test(id);
}
```
### 9.3 Rate Limit Compliance
**Token Bucket Implementation**:
```typescript
class RateLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number,
private refillRate: number // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<void> {
this.refill();
while (this.tokens < 1) {
await sleep(100);
this.refill();
}
this.tokens -= 1;
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.maxTokens,
this.tokens + elapsed * this.refillRate
);
this.lastRefill = now;
}
}
// Usage
const s2Limiter = new RateLimiter(100, 100); // 100 req/sec
await s2Limiter.acquire();
// ... make request
```
### 9.4 Data Privacy
**Sensitive Data**:
- User emails (for Unpaywall, CrossRef)
- Search queries (may contain proprietary info)
- Downloaded PDFs (copyright considerations)
**Recommendations**:
1. **Do not log** API keys, full queries, or user emails
2. **Redact** sensitive fields in logs:
```typescript
function redactForLogging(url: string): string {
return url.replace(/([?&]email=)[^&]+/, "$1REDACTED");
}
```
3. **Respect copyright**: Only download OA papers, check licenses
4. **Cache carefully**: Ensure cached data doesn't violate API ToS
### 9.5 HTTPS Enforcement
**Always use HTTPS** for API calls:
```typescript
function validateAPIURL(url: string): void {
if (!url.startsWith("https://")) {
throw new Error("API URLs must use HTTPS");
}
}
```
**Exception**: arXiv API uses HTTP, but transitioning to HTTPS. Support both:
```typescript
const ARXIV_BASE_URL = "https://export.arxiv.org/api/query"; // Prefer HTTPS
// Fallback to HTTP if HTTPS unavailable
```
## 10. Testing Approach
### 10.1 Mock Endpoints
**Strategy**: Use mock servers for unit/integration tests to avoid hitting real APIs
**Tools**:
- `nock` (Node.js HTTP mocking)
- `msw` (Mock Service Worker)
**Example with nock**:
```typescript
import nock from "nock";
describe("SemanticScholarClient", () => {
beforeEach(() => {
nock("https://api.semanticscholar.org")
.get("/graph/v1/paper/search")
.query({ query: "transformer", limit: 10 })
.reply(200, {
total: 1,
offset: 0,
data: [{
paperId: "abc123",
title: "Attention Is All You Need",
abstract: "The dominant sequence transduction models...",
year: 2017,
citationCount: 50000
}]
});
});
it("should fetch papers by search query", async () => {
const client = new SemanticScholarClient();
const results = await client.search("transformer", { limit: 10 });
expect(results.data).toHaveLength(1);
expect(results.data[0].title).toBe("Attention Is All You Need");
});
});
```
### 10.2 Integration Test Fixtures
**Location**: `test/fixtures/api-responses/`
**Structure**:
```
test/fixtures/api-responses/
├── semantic-scholar/
│ ├── paper-search-transformers.json
│ ├── paper-details-1706.03762.json
│ └── citations-abc123.json
├── crossref/
│ └── work-10.1145.3491102.3517582.json
├── arxiv/
│ └── search-chain-of-thought.xml
└── unpaywall/
└── oa-status-10.1038.nature12373.json
```
**Loading Fixtures**:
```typescript
import fs from "fs/promises";
import path from "path";
async function loadFixture(name: string): Promise<any> {
const filePath = path.join(__dirname, "../fixtures/api-responses", name);
const content = await fs.readFile(filePath, "utf-8");
return name.endsWith(".xml") ? content : JSON.parse(content);
}
// Usage in tests
const mockResponse = await loadFixture("semantic-scholar/paper-search-transformers.json");
```
### 10.3 Contract Testing
**Purpose**: Ensure our code matches actual API schemas
**Approach**: Record real API responses, validate against schemas
**Example with Zod**:
```typescript
import { z } from "zod";
const PaperSchema = z.object({
paperId: z.string(),
title: z.string(),
abstract: z.string().optional(),
year: z.number().optional(),
citationCount: z.number().optional(),
authors: z.array(z.object({
authorId: z.string(),
name: z.string()
})).optional()
});
describe("Semantic Scholar API contract", () => {
it("should match expected schema", async () => {
const response = await fetch("https://api.semanticscholar.org/graph/v1/paper/abc123");
const data = await response.json();
// Throws if schema doesn't match
const parsed = PaperSchema.parse(data);
expect(parsed.paperId).toBe("abc123");
});
});
```
### 10.4 Rate Limit Testing
**Mock rate limit headers**:
```typescript
nock("https://api.semanticscholar.org")
.get("/graph/v1/paper/search")
.reply(429, { error: "Rate limit exceeded" }, {
"X-RateLimit-Limit": "100",
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": String(Date.now() + 5000)
});
```
**Test retry logic**:
```typescript
it("should retry on 429 with exponential backoff", async () => {
nock("https://api.semanticscholar.org")
.get("/graph/v1/paper/search")
.reply(429)
.get("/graph/v1/paper/search")
.reply(200, { data: [] });
const client = new SemanticScholarClient();
const start = Date.now();
await client.search("test");
const elapsed = Date.now() - start;
// Should have waited ~1s before retry
expect(elapsed).toBeGreaterThan(900);
});
```
### 10.5 Error Scenario Coverage
**Test matrix**:
| Scenario | HTTP Code | Expected Behavior |
|----------|-----------|-------------------|
| Success | 200 | Return parsed data |
| Not found | 404 | Throw NOT_FOUND error |
| Rate limit | 429 | Retry with backoff |
| Server error | 500 | Retry 3x, then throw |
| Network timeout | - | Retry 3x, then throw NETWORK error |
| Invalid JSON | 200 | Throw PARSE_ERROR |
**Example test**:
```typescript
describe("Error handling", () => {
it("should throw NOT_FOUND on 404", async () => {
nock("https://api.semanticscholar.org")
.get("/graph/v1/paper/invalid")
.reply(404);
const client = new SemanticScholarClient();
await expect(client.getPaper("invalid")).rejects.toThrow("NOT_FOUND");
});
});
```
## References
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/solution-profile.md - Overall research framework design
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/use-cases.md - Research workflow use cases
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/data-model.md - Paper metadata schema
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/rules/token-security.md - Secure token handling patterns
- [Semantic Scholar API Docs](https://api.semanticscholar.org/api-docs/)
- [CrossRef API Docs](https://api.crossref.org/swagger-ui/)
- [arXiv API Docs](https://arxiv.org/help/api/)
- [Unpaywall API Docs](https://unpaywall.org/products/api)
- [Anthropic API Docs](https://docs.anthropic.com/claude/reference/)
- [OpenAI API Docs](https://platform.openai.com/docs/api-reference)
**Document Status**: Draft
**Next Steps**:
1. Review with System Analyst and Architecture Designer
2. Validate against research workflow use cases
3. Create TypeScript interfaces in codebase
4. Implement adapter layer with caching
5. Write integration tests with fixtures
6. Document configuration setup in README
**Metadata**:
- **Created**: 2026-01-25
- **Author**: API Designer (Claude Code)
- **Phase**: Elaboration
- **Artifact Type**: Technical Specification
- **Related Use Cases**: UC-001, UC-002, UC-003, UC-004
- **Related Architecture**: @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/data-model.md