UNPKG

aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

584 lines (449 loc) 15 kB
# Research PDF Optimization Guide **Version**: 1.0.0 **Last Updated**: 2026-02-06 **Issue**: #290 ## Overview Claude Code v2.1.30 introduced PDF page range support in the Read tool, enabling dramatic reductions in context consumption for research workflows. Large PDFs (>10 pages) now return lightweight references when @-mentioned, with full content available via targeted page range reads. This guide documents optimization patterns for AIWG research tools to leverage this capability. ## PDF Page Range Support ### Read Tool Enhancement The Read tool now accepts a `pages` parameter for targeted PDF reading: ```markdown # Read just the abstract Read file_path=".aiwg/research/sources/REF-015.pdf" pages="1" # Read methodology section Read file_path=".aiwg/research/sources/REF-015.pdf" pages="4-8" # Verify a specific citation Read file_path=".aiwg/research/sources/REF-015.pdf" pages="12" # Read multiple non-contiguous pages Read file_path=".aiwg/research/sources/REF-015.pdf" pages="1,5,10" # Read last few pages Read file_path=".aiwg/research/sources/REF-015.pdf" pages="-2" # Last 2 pages ``` ### Page Range Format | Format | Description | Example | |--------|-------------|---------| | `"N"` | Single page | `"5"` | | `"N-M"` | Contiguous range | `"3-7"` | | `"N,M,O"` | Non-contiguous pages | `"1,5,10"` | | `"-N"` | Last N pages | `"-3"` (last 3 pages) | ### Lightweight References When large PDFs (>10 pages) are @-mentioned: **Before (full context)**: ``` @.aiwg/research/sources/REF-015.pdf Full 15-page paper loaded (15,000+ tokens) ``` **After (lightweight reference)**: ``` @.aiwg/research/sources/REF-015.pdf Metadata + page count only (500 tokens) "To read full content, use: Read file_path='...' pages='1-15'" ``` This enables mentioning many PDFs in context for reference without exhausting token budget. ## Standard Academic Paper Structure ### Typical Page Mapping Academic papers follow predictable structures. Use these ranges as starting points: | Section | Typical Pages | Purpose | |---------|---------------|---------| | **Abstract** | 1 | High-level summary, key findings | | **Introduction** | 1-3 | Problem statement, motivation | | **Related Work** | 3-5 | Literature review, positioning | | **Methodology** | 4-8 | Approach, experimental design | | **Results** | 7-10 | Findings, data, analysis | | **Discussion** | 9-12 | Interpretation, implications | | **Conclusion** | -2 to -1 | Summary, future work | | **References** | -4 to -1 | Bibliography | ### Variations by Venue **Conference Papers (6-8 pages)**: - Abstract: page 1 - Introduction + Related Work: pages 1-3 - Methodology: pages 3-5 - Results + Discussion: pages 5-7 - Conclusion: page 7-8 **Journal Articles (10-15 pages)**: - Abstract: page 1 - Introduction: pages 1-2 - Related Work: pages 2-4 - Methodology: pages 4-7 - Results: pages 7-10 - Discussion: pages 10-12 - Conclusion: pages 12-13 - References: pages 13-15 **Technical Reports (variable)**: - Executive Summary: pages 1-2 - Main Content: varies widely - Appendices: often extensive ## AIWG Research Tool Optimization ### /citation-check **Purpose**: Verify citations reference correct page numbers and content. **Optimization Pattern**: ```yaml before: - Read full PDF (all pages) - Extract all potential citations - Token cost: 15,000+ per paper after: - Read only cited pages from frontmatter - Example: pages="12,15,23" for three citations - Token cost: ~500-1000 per paper (90%+ reduction) ``` **Implementation**: ```markdown # Extract cited pages from frontmatter cited_pages: [12, 15, 23] # Read only those pages for page in cited_pages: Read file_path=".aiwg/research/sources/REF-XXX.pdf" pages="{page}" ``` ### /verify-citations **Purpose**: Validate citation accuracy and completeness. **Optimization Pattern**: ```yaml workflow: 1. Parse frontmatter for citation metadata 2. Extract page numbers from key_findings 3. Read only specified pages 4. Verify quoted text matches source 5. Check page numbers are accurate example: finding: "System achieves 34% improvement (p. 7)" pages_to_read: "7" verification: Read file_path="..." pages="7" ``` ### /quality-assess (GRADE) **Purpose**: Assess research quality for GRADE baseline. **Optimization Pattern**: ```yaml quality_assessment_pages: study_design: pages="1,3-5" # Abstract + methodology sample_size: pages="4-6" # Usually in methods bias_assessment: pages="4-8" # Methods + results consistency: pages="7-10" # Results section directness: pages="1,10-12" # Abstract + discussion token_savings: ~80% vs full-paper read ``` **Implementation**: ```markdown # Phase 1: Quick scan (abstract only) Read file_path="REF-XXX.pdf" pages="1" Determine study type, assess baseline quality # Phase 2: Methodology deep-dive (if needed) Read file_path="REF-XXX.pdf" pages="4-8" Assess bias, sample size, validity # Phase 3: Results verification (if needed) Read file_path="REF-XXX.pdf" pages="7-10" Check consistency, precision ``` ### /corpus-health **Purpose**: Scan entire corpus for metadata quality and completeness. **Optimization Pattern**: ```yaml before: - Load all papers fully - Token cost: 500,000+ for 50 papers - Context overflow common after: - Read abstracts only (page 1) - Token cost: ~25,000 for 50 papers (95% reduction) - Detect issues: missing metadata, poor quality, outdated ``` **Implementation**: ```markdown for each paper in corpus: Read file_path="paper.pdf" pages="1" extract: - title - authors - publication_year - key_findings_summary compare_with_frontmatter() ``` ### /grade-report **Purpose**: Generate comprehensive GRADE evidence profiles. **Optimization Pattern**: ```yaml tiered_reading: tier_1_abstract: pages: "1" purpose: "Quick classification" tier_2_methods: pages: "4-8" purpose: "Bias and quality assessment" triggered_by: "tier_1 indicates HIGH potential" tier_3_results: pages: "7-10" purpose: "Precision and consistency" triggered_by: "tier_2 confirms HIGH quality" tier_4_full: pages: "all" purpose: "Comprehensive analysis" triggered_by: "contradictory findings or unclear quality" ``` ### Research Synthesis Tools **Purpose**: Compare findings across multiple papers. **Optimization Pattern**: ```yaml comparative_analysis: step_1_abstracts: pages: "1" for_all_papers: true purpose: "Identify relevant papers" step_2_results: pages: "7-10" for_selected_papers: true purpose: "Extract comparable metrics" step_3_methodology: pages: "4-8" for_conflicting_papers: true purpose: "Resolve discrepancies" token_budget: abstracts_50_papers: 25,000 tokens results_10_papers: 15,000 tokens methods_3_papers: 6,000 tokens total: 46,000 tokens (vs 750,000 full read) ``` ## Usage Examples ### Example 1: Citation Verification ```markdown # Scenario: Verify claim from REF-015 page 12 ## Step 1: Read frontmatter to get claim Read file_path=".aiwg/research/findings/REF-015.md" ## Step 2: Extract cited page Claim: "Self-Refine improves quality by 20% (p. 12)" Cited page: 12 ## Step 3: Read only that page Read file_path=".aiwg/research/sources/REF-015.pdf" pages="12" ## Step 4: Verify quote accuracy Search for "20%" or "improvement" in page 12 content ``` ### Example 2: Quality Assessment (GRADE) ```markdown # Scenario: Assess REF-042 for GRADE baseline ## Phase 1: Quick scan (abstract) Read file_path=".aiwg/research/sources/REF-042.pdf" pages="1" Study type: Randomized Controlled Trial Baseline: HIGH ## Phase 2: Methodology check Read file_path=".aiwg/research/sources/REF-042.pdf" pages="4-8" Sample size: 120 participants Randomization: proper Blinding: double-blind Confirmed: HIGH quality ## Phase 3: Results consistency Read file_path=".aiwg/research/sources/REF-042.pdf" pages="8-10" Confidence intervals reported Effect size: significant (p<0.01) No inconsistencies detected ## Final: HIGH quality (no downgrades) ``` ### Example 3: Bulk Corpus Scan ```markdown # Scenario: Check 50 papers for missing authors in frontmatter ## For each paper: for ref in REF-001 through REF-050: # Read lightweight reference @.aiwg/research/sources/{ref}.pdf # Read abstract only Read file_path=".aiwg/research/sources/{ref}.pdf" pages="1" # Extract authors from page 1 authors = extract_authors(page_1_content) # Compare with frontmatter frontmatter_authors = load_frontmatter_authors(ref) if authors != frontmatter_authors: flag_discrepancy(ref) ## Token cost: ~25,000 (vs 750,000 for full reads) ``` ### Example 4: Multi-Paper Synthesis ```markdown # Scenario: Compare TDD effectiveness across 5 studies ## Step 1: Read all abstracts Read file_path="REF-010.pdf" pages="1" # TDD study 1 Read file_path="REF-023.pdf" pages="1" # TDD study 2 Read file_path="REF-035.pdf" pages="1" # TDD study 3 Read file_path="REF-041.pdf" pages="1" # TDD study 4 Read file_path="REF-052.pdf" pages="1" # TDD study 5 ## Step 2: Identify key metrics pages Study 1: results on page 9 Study 2: results on page 12 Study 3: results on page 8 Study 4: results on page 10 Study 5: results on page 7 ## Step 3: Read results sections Read file_path="REF-010.pdf" pages="9" Read file_path="REF-023.pdf" pages="12" Read file_path="REF-035.pdf" pages="8" Read file_path="REF-041.pdf" pages="10" Read file_path="REF-052.pdf" pages="7" ## Step 4: Synthesize findings Effect sizes: 15%, 22%, 18%, 30%, 12% Mean: 19.4% Range: 12-30% ## Token cost: ~6,000 (vs 75,000 for full papers) ``` ## Best Practices ### 1. Always Start with Abstracts ```markdown # Good: Progressive disclosure Read file_path="paper.pdf" pages="1" # Abstract first if relevant: Read file_path="paper.pdf" pages="4-8" # Then methodology if still_needed: Read file_path="paper.pdf" pages="all" # Full paper last # Bad: Load everything upfront Read file_path="paper.pdf" pages="all" # Wastes tokens if not relevant ``` ### 2. Use Frontmatter to Guide Page Selection ```yaml # Store key page numbers in frontmatter key_findings: - finding: "34% improvement" metric: "+34% accuracy" page: 7 # ← Use this for targeted reads impact: high - finding: "Sample size n=500" page: 5 # ← Methodology section impact: high ``` ### 3. Batch Related Page Reads ```markdown # Good: Single read for contiguous sections Read file_path="paper.pdf" pages="4-8" # Methods + results together # Less efficient: Multiple reads Read file_path="paper.pdf" pages="4" Read file_path="paper.pdf" pages="5" Read file_path="paper.pdf" pages="6" Read file_path="paper.pdf" pages="7" Read file_path="paper.pdf" pages="8" ``` ### 4. Document Page Mappings in Corpus ```yaml # Add to research corpus metadata paper_structure: abstract: 1 introduction: 2-3 methodology: 4-6 results: 7-9 discussion: 9-11 conclusion: 11-12 references: 12-14 # Enables automation: Read file_path="paper.pdf" pages="{paper_structure.methodology}" ``` ## Cross-Platform Compatibility ### Platform Support Matrix | Platform | PDF Page Ranges | Fallback Behavior | |----------|-----------------|-------------------| | Claude Code v2.1.30+ | Supported | N/A | | Claude Code <v2.1.30 | Not supported | Full PDF read | | GitHub Copilot | Not supported | Full PDF read | | Cursor | Not supported | Full PDF read | | Factory AI | Not supported | Full PDF read | ### Graceful Degradation Research tools MUST handle both modes: ```typescript // Detect page range support function readPdfPages(path: string, pages?: string): string { try { // Attempt page range read return readTool({ file_path: path, pages }); } catch (error) { if (error.message.includes('pages parameter not supported')) { // Fall back to full read console.warn('Page ranges not supported, reading full PDF'); return readTool({ file_path: path }); } throw error; } } ``` ### Detection Method ```markdown # Test for page range support try: Read file_path="test.pdf" pages="1" PAGE_RANGES_SUPPORTED = true except: PAGE_RANGES_SUPPORTED = false ``` ## Integration with Research Metadata Rules ### Enhanced Frontmatter ```yaml --- ref_id: "REF-015" title: "Self-Refine: Iterative Refinement with Self-Feedback" pdf_hash: "a1b2c3d4..." # NEW: Page-level metadata page_structure: abstract: 1 methodology: 4-7 results: 7-10 key_tables: - page: 8 caption: "Performance comparison" - page: 9 caption: "Ablation study" # NEW: Citation page references key_findings: - finding: "20% improvement over baseline" metric: "+20% accuracy" page: 9 # ← Enables targeted verification impact: high - finding: "94% of failures due to bad feedback" metric: "94% attribution" page: 12 # ← Direct citation verification impact: critical --- ``` ### Validation Updates ```yaml validation_checklist: # Existing checks - pdf_hash_recorded - doi_verified - frontmatter_complete # NEW: Page-level checks - page_structure_documented - key_findings_have_pages - citation_pages_verified ``` ## Performance Comparison ### Token Consumption Analysis | Operation | Before (Full Read) | After (Page Ranges) | Savings | |-----------|-------------------|---------------------|---------| | Single paper citation check | 15,000 tokens | 500 tokens | 97% | | GRADE assessment (5 papers) | 75,000 tokens | 10,000 tokens | 87% | | Corpus health scan (50 papers) | 750,000 tokens | 25,000 tokens | 97% | | Synthesis (10 papers) | 150,000 tokens | 20,000 tokens | 87% | ### Real-World Example **Scenario**: Verify 20 citations across 15 papers **Before**: ``` Load 15 full PDFs: 15 × 10,000 = 150,000 tokens Often exceeds context window Requires multiple sessions ``` **After**: ``` Load 20 specific pages: 20 × 500 = 10,000 tokens Single session 93% token reduction ``` ## Migration Checklist When updating research tools to use page ranges: - [ ] Test for page range support before using - [ ] Implement graceful fallback to full reads - [ ] Update frontmatter schema to include page structure - [ ] Document page mappings for key papers - [ ] Add page numbers to key_findings - [ ] Update citation verification to use targeted reads - [ ] Refactor corpus health checks to use abstracts only - [ ] Add page range examples to tool documentation - [ ] Test cross-platform compatibility - [ ] Update token budget calculations ## References - @.claude/rules/research-metadata.md - Research metadata requirements - @docs/cli-reference.md - AIWG CLI commands - @.aiwg/research/docs/grade-assessment-guide.md - GRADE methodology - Claude Code v2.1.30 Release Notes - PDF page range feature --- **Status**: ACTIVE **Version**: 1.0.0 **Issue**: #290