aiwg
Version:
Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.
584 lines (449 loc) • 15 kB
Markdown
# Research PDF Optimization Guide
**Version**: 1.0.0
**Last Updated**: 2026-02-06
**Issue**: #290
## Overview
Claude Code v2.1.30 introduced PDF page range support in the Read tool, enabling dramatic reductions in context consumption for research workflows. Large PDFs (>10 pages) now return lightweight references when @-mentioned, with full content available via targeted page range reads.
This guide documents optimization patterns for AIWG research tools to leverage this capability.
## PDF Page Range Support
### Read Tool Enhancement
The Read tool now accepts a `pages` parameter for targeted PDF reading:
```markdown
# Read just the abstract
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="1"
# Read methodology section
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="4-8"
# Verify a specific citation
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="12"
# Read multiple non-contiguous pages
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="1,5,10"
# Read last few pages
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="-2" # Last 2 pages
```
### Page Range Format
| Format | Description | Example |
|--------|-------------|---------|
| `"N"` | Single page | `"5"` |
| `"N-M"` | Contiguous range | `"3-7"` |
| `"N,M,O"` | Non-contiguous pages | `"1,5,10"` |
| `"-N"` | Last N pages | `"-3"` (last 3 pages) |
### Lightweight References
When large PDFs (>10 pages) are @-mentioned:
**Before (full context)**:
```
@.aiwg/research/sources/REF-015.pdf → Full 15-page paper loaded (15,000+ tokens)
```
**After (lightweight reference)**:
```
@.aiwg/research/sources/REF-015.pdf → Metadata + page count only (500 tokens)
"To read full content, use: Read file_path='...' pages='1-15'"
```
This enables mentioning many PDFs in context for reference without exhausting token budget.
## Standard Academic Paper Structure
### Typical Page Mapping
Academic papers follow predictable structures. Use these ranges as starting points:
| Section | Typical Pages | Purpose |
|---------|---------------|---------|
| **Abstract** | 1 | High-level summary, key findings |
| **Introduction** | 1-3 | Problem statement, motivation |
| **Related Work** | 3-5 | Literature review, positioning |
| **Methodology** | 4-8 | Approach, experimental design |
| **Results** | 7-10 | Findings, data, analysis |
| **Discussion** | 9-12 | Interpretation, implications |
| **Conclusion** | -2 to -1 | Summary, future work |
| **References** | -4 to -1 | Bibliography |
### Variations by Venue
**Conference Papers (6-8 pages)**:
- Abstract: page 1
- Introduction + Related Work: pages 1-3
- Methodology: pages 3-5
- Results + Discussion: pages 5-7
- Conclusion: page 7-8
**Journal Articles (10-15 pages)**:
- Abstract: page 1
- Introduction: pages 1-2
- Related Work: pages 2-4
- Methodology: pages 4-7
- Results: pages 7-10
- Discussion: pages 10-12
- Conclusion: pages 12-13
- References: pages 13-15
**Technical Reports (variable)**:
- Executive Summary: pages 1-2
- Main Content: varies widely
- Appendices: often extensive
## AIWG Research Tool Optimization
### /citation-check
**Purpose**: Verify citations reference correct page numbers and content.
**Optimization Pattern**:
```yaml
before:
- Read full PDF (all pages)
- Extract all potential citations
- Token cost: 15,000+ per paper
after:
- Read only cited pages from frontmatter
- Example: pages="12,15,23" for three citations
- Token cost: ~500-1000 per paper (90%+ reduction)
```
**Implementation**:
```markdown
# Extract cited pages from frontmatter
cited_pages: [12, 15, 23]
# Read only those pages
for page in cited_pages:
Read file_path=".aiwg/research/sources/REF-XXX.pdf" pages="{page}"
```
### /verify-citations
**Purpose**: Validate citation accuracy and completeness.
**Optimization Pattern**:
```yaml
workflow:
1. Parse frontmatter for citation metadata
2. Extract page numbers from key_findings
3. Read only specified pages
4. Verify quoted text matches source
5. Check page numbers are accurate
example:
finding: "System achieves 34% improvement (p. 7)"
pages_to_read: "7"
verification: Read file_path="..." pages="7"
```
### /quality-assess (GRADE)
**Purpose**: Assess research quality for GRADE baseline.
**Optimization Pattern**:
```yaml
quality_assessment_pages:
study_design: pages="1,3-5" # Abstract + methodology
sample_size: pages="4-6" # Usually in methods
bias_assessment: pages="4-8" # Methods + results
consistency: pages="7-10" # Results section
directness: pages="1,10-12" # Abstract + discussion
token_savings: ~80% vs full-paper read
```
**Implementation**:
```markdown
# Phase 1: Quick scan (abstract only)
Read file_path="REF-XXX.pdf" pages="1"
→ Determine study type, assess baseline quality
# Phase 2: Methodology deep-dive (if needed)
Read file_path="REF-XXX.pdf" pages="4-8"
→ Assess bias, sample size, validity
# Phase 3: Results verification (if needed)
Read file_path="REF-XXX.pdf" pages="7-10"
→ Check consistency, precision
```
### /corpus-health
**Purpose**: Scan entire corpus for metadata quality and completeness.
**Optimization Pattern**:
```yaml
before:
- Load all papers fully
- Token cost: 500,000+ for 50 papers
- Context overflow common
after:
- Read abstracts only (page 1)
- Token cost: ~25,000 for 50 papers (95% reduction)
- Detect issues: missing metadata, poor quality, outdated
```
**Implementation**:
```markdown
for each paper in corpus:
Read file_path="paper.pdf" pages="1"
extract:
- title
- authors
- publication_year
- key_findings_summary
compare_with_frontmatter()
```
### /grade-report
**Purpose**: Generate comprehensive GRADE evidence profiles.
**Optimization Pattern**:
```yaml
tiered_reading:
tier_1_abstract:
pages: "1"
purpose: "Quick classification"
tier_2_methods:
pages: "4-8"
purpose: "Bias and quality assessment"
triggered_by: "tier_1 indicates HIGH potential"
tier_3_results:
pages: "7-10"
purpose: "Precision and consistency"
triggered_by: "tier_2 confirms HIGH quality"
tier_4_full:
pages: "all"
purpose: "Comprehensive analysis"
triggered_by: "contradictory findings or unclear quality"
```
### Research Synthesis Tools
**Purpose**: Compare findings across multiple papers.
**Optimization Pattern**:
```yaml
comparative_analysis:
step_1_abstracts:
pages: "1"
for_all_papers: true
purpose: "Identify relevant papers"
step_2_results:
pages: "7-10"
for_selected_papers: true
purpose: "Extract comparable metrics"
step_3_methodology:
pages: "4-8"
for_conflicting_papers: true
purpose: "Resolve discrepancies"
token_budget:
abstracts_50_papers: 25,000 tokens
results_10_papers: 15,000 tokens
methods_3_papers: 6,000 tokens
total: 46,000 tokens (vs 750,000 full read)
```
## Usage Examples
### Example 1: Citation Verification
```markdown
# Scenario: Verify claim from REF-015 page 12
## Step 1: Read frontmatter to get claim
Read file_path=".aiwg/research/findings/REF-015.md"
## Step 2: Extract cited page
Claim: "Self-Refine improves quality by 20% (p. 12)"
Cited page: 12
## Step 3: Read only that page
Read file_path=".aiwg/research/sources/REF-015.pdf" pages="12"
## Step 4: Verify quote accuracy
Search for "20%" or "improvement" in page 12 content
```
### Example 2: Quality Assessment (GRADE)
```markdown
# Scenario: Assess REF-042 for GRADE baseline
## Phase 1: Quick scan (abstract)
Read file_path=".aiwg/research/sources/REF-042.pdf" pages="1"
→ Study type: Randomized Controlled Trial → Baseline: HIGH
## Phase 2: Methodology check
Read file_path=".aiwg/research/sources/REF-042.pdf" pages="4-8"
→ Sample size: 120 participants
→ Randomization: proper
→ Blinding: double-blind
→ Confirmed: HIGH quality
## Phase 3: Results consistency
Read file_path=".aiwg/research/sources/REF-042.pdf" pages="8-10"
→ Confidence intervals reported
→ Effect size: significant (p<0.01)
→ No inconsistencies detected
## Final: HIGH quality (no downgrades)
```
### Example 3: Bulk Corpus Scan
```markdown
# Scenario: Check 50 papers for missing authors in frontmatter
## For each paper:
for ref in REF-001 through REF-050:
# Read lightweight reference
@.aiwg/research/sources/{ref}.pdf
# Read abstract only
Read file_path=".aiwg/research/sources/{ref}.pdf" pages="1"
# Extract authors from page 1
authors = extract_authors(page_1_content)
# Compare with frontmatter
frontmatter_authors = load_frontmatter_authors(ref)
if authors != frontmatter_authors:
flag_discrepancy(ref)
## Token cost: ~25,000 (vs 750,000 for full reads)
```
### Example 4: Multi-Paper Synthesis
```markdown
# Scenario: Compare TDD effectiveness across 5 studies
## Step 1: Read all abstracts
Read file_path="REF-010.pdf" pages="1" # TDD study 1
Read file_path="REF-023.pdf" pages="1" # TDD study 2
Read file_path="REF-035.pdf" pages="1" # TDD study 3
Read file_path="REF-041.pdf" pages="1" # TDD study 4
Read file_path="REF-052.pdf" pages="1" # TDD study 5
## Step 2: Identify key metrics pages
Study 1: results on page 9
Study 2: results on page 12
Study 3: results on page 8
Study 4: results on page 10
Study 5: results on page 7
## Step 3: Read results sections
Read file_path="REF-010.pdf" pages="9"
Read file_path="REF-023.pdf" pages="12"
Read file_path="REF-035.pdf" pages="8"
Read file_path="REF-041.pdf" pages="10"
Read file_path="REF-052.pdf" pages="7"
## Step 4: Synthesize findings
Effect sizes: 15%, 22%, 18%, 30%, 12%
Mean: 19.4%
Range: 12-30%
## Token cost: ~6,000 (vs 75,000 for full papers)
```
## Best Practices
### 1. Always Start with Abstracts
```markdown
# Good: Progressive disclosure
Read file_path="paper.pdf" pages="1" # Abstract first
if relevant:
Read file_path="paper.pdf" pages="4-8" # Then methodology
if still_needed:
Read file_path="paper.pdf" pages="all" # Full paper last
# Bad: Load everything upfront
Read file_path="paper.pdf" pages="all" # Wastes tokens if not relevant
```
### 2. Use Frontmatter to Guide Page Selection
```yaml
# Store key page numbers in frontmatter
key_findings:
- finding: "34% improvement"
metric: "+34% accuracy"
page: 7 # ← Use this for targeted reads
impact: high
- finding: "Sample size n=500"
page: 5 # ← Methodology section
impact: high
```
### 3. Batch Related Page Reads
```markdown
# Good: Single read for contiguous sections
Read file_path="paper.pdf" pages="4-8" # Methods + results together
# Less efficient: Multiple reads
Read file_path="paper.pdf" pages="4"
Read file_path="paper.pdf" pages="5"
Read file_path="paper.pdf" pages="6"
Read file_path="paper.pdf" pages="7"
Read file_path="paper.pdf" pages="8"
```
### 4. Document Page Mappings in Corpus
```yaml
# Add to research corpus metadata
paper_structure:
abstract: 1
introduction: 2-3
methodology: 4-6
results: 7-9
discussion: 9-11
conclusion: 11-12
references: 12-14
# Enables automation:
Read file_path="paper.pdf" pages="{paper_structure.methodology}"
```
## Cross-Platform Compatibility
### Platform Support Matrix
| Platform | PDF Page Ranges | Fallback Behavior |
|----------|-----------------|-------------------|
| Claude Code v2.1.30+ | ✅ Supported | N/A |
| Claude Code <v2.1.30 | ❌ Not supported | Full PDF read |
| GitHub Copilot | ❌ Not supported | Full PDF read |
| Cursor | ❌ Not supported | Full PDF read |
| Factory AI | ❌ Not supported | Full PDF read |
### Graceful Degradation
Research tools MUST handle both modes:
```typescript
// Detect page range support
function readPdfPages(path: string, pages?: string): string {
try {
// Attempt page range read
return readTool({ file_path: path, pages });
} catch (error) {
if (error.message.includes('pages parameter not supported')) {
// Fall back to full read
console.warn('Page ranges not supported, reading full PDF');
return readTool({ file_path: path });
}
throw error;
}
}
```
### Detection Method
```markdown
# Test for page range support
try:
Read file_path="test.pdf" pages="1"
PAGE_RANGES_SUPPORTED = true
except:
PAGE_RANGES_SUPPORTED = false
```
## Integration with Research Metadata Rules
### Enhanced Frontmatter
```yaml
ref_id: "REF-015"
title: "Self-Refine: Iterative Refinement with Self-Feedback"
pdf_hash: "a1b2c3d4..."
# NEW: Page-level metadata
page_structure:
abstract: 1
methodology: 4-7
results: 7-10
key_tables:
- page: 8
caption: "Performance comparison"
- page: 9
caption: "Ablation study"
# NEW: Citation page references
key_findings:
- finding: "20% improvement over baseline"
metric: "+20% accuracy"
page: 9 # ← Enables targeted verification
impact: high
- finding: "94% of failures due to bad feedback"
metric: "94% attribution"
page: 12 # ← Direct citation verification
impact: critical
```
### Validation Updates
```yaml
validation_checklist:
# Existing checks
- pdf_hash_recorded
- doi_verified
- frontmatter_complete
# NEW: Page-level checks
- page_structure_documented
- key_findings_have_pages
- citation_pages_verified
```
## Performance Comparison
### Token Consumption Analysis
| Operation | Before (Full Read) | After (Page Ranges) | Savings |
|-----------|-------------------|---------------------|---------|
| Single paper citation check | 15,000 tokens | 500 tokens | 97% |
| GRADE assessment (5 papers) | 75,000 tokens | 10,000 tokens | 87% |
| Corpus health scan (50 papers) | 750,000 tokens | 25,000 tokens | 97% |
| Synthesis (10 papers) | 150,000 tokens | 20,000 tokens | 87% |
### Real-World Example
**Scenario**: Verify 20 citations across 15 papers
**Before**:
```
Load 15 full PDFs: 15 × 10,000 = 150,000 tokens
→ Often exceeds context window
→ Requires multiple sessions
```
**After**:
```
Load 20 specific pages: 20 × 500 = 10,000 tokens
→ Single session
→ 93% token reduction
```
## Migration Checklist
When updating research tools to use page ranges:
- [ ] Test for page range support before using
- [ ] Implement graceful fallback to full reads
- [ ] Update frontmatter schema to include page structure
- [ ] Document page mappings for key papers
- [ ] Add page numbers to key_findings
- [ ] Update citation verification to use targeted reads
- [ ] Refactor corpus health checks to use abstracts only
- [ ] Add page range examples to tool documentation
- [ ] Test cross-platform compatibility
- [ ] Update token budget calculations
## References
- @.claude/rules/research-metadata.md - Research metadata requirements
- @docs/cli-reference.md - AIWG CLI commands
- @.aiwg/research/docs/grade-assessment-guide.md - GRADE methodology
- Claude Code v2.1.30 Release Notes - PDF page range feature
**Status**: ACTIVE
**Version**: 1.0.0
**Issue**: #290