UNPKG

pdfvector

Version:

Official TypeScript/JavaScript SDK for PDF Vector API - Parse PDFs to markdown and search academic publications across multiple databases

481 lines (362 loc) 12.4 kB
# PDF Vector TypeScript/JavaScript SDK The official TypeScript/JavaScript SDK for the PDF Vector API: Convert PDF and Word documents to clean, structured markdown format with optional AI enhancement, search across multiple academic databases with a unified API, and fetch specific publications by DOI, PubMed ID, ArXiv ID, and more. ## Installation ```bash npm install pdfvector # or yarn add pdfvector # or pnpm add pdfvector # or bun add pdfvector ``` ## Quick Start ```typescript import { PDFVector } from "pdfvector"; const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); // Parse from document URL or data const parseResult = await client.parse({ url: "https://example.com/document.pdf", useLLM: "auto", }); console.log(parseResult.markdown); // Return clean markdown console.log( `Pages: ${parseResult.pageCount}, Credits: ${parseResult.creditCount}`, ); ``` ## Authentication Get your API key from the [PDF Vector dashboard](https://www.pdfvector.com/api-keys). The SDK requires a valid API key for all operations. ```typescript const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); ``` ## Usage Examples ### Parse from URL ```typescript import { PDFVector } from "pdfvector"; const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); const result = await client.parse({ url: "https://arxiv.org/pdf/2301.00001.pdf", useLLM: "auto", }); console.log(result.markdown); ``` ### Parse from data ```typescript import { readFile } from "fs/promises"; import { PDFVector } from "pdfvector"; const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); const result = await client.parse({ data: await readFile("document.pdf"), contentType: "application/pdf", useLLM: "auto", }); console.log(result.markdown); ``` ### Search academic publications ```typescript import { PDFVector } from "pdfvector"; const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); const searchResponse = await client.academicSearch({ query: "quantum computing", providers: ["semantic-scholar", "arxiv", "pubmed"], // Search across multiple academic databases limit: 20, yearFrom: 2021, yearTo: 2024, }); searchResponse.results.forEach((publication) => { console.log(`Title: ${publication.title}`); console.log(`Authors: ${publication.authors?.map((a) => a.name).join(", ")}`); console.log(`Year: ${publication.year}`); console.log(`Abstract: ${publication.abstract}`); console.log("---"); }); ``` ### Search with Provider-Specific Data ```typescript const searchResponse = await client.academicSearch({ query: "CRISPR gene editing", providers: ["semantic-scholar"], fields: ["title", "authors", "year", "providerData"], //providerData is Provider-Specific data field }); searchResponse.results.forEach((pub) => { if (pub.provider === "semantic-scholar" && pub.providerData) { const data = pub.providerData; console.log(`Influential Citations: ${data.influentialCitationCount}`); console.log(`Fields of Study: ${data.fieldsOfStudy?.join(", ")}`); } }); ``` ### Fetch Academic Publications by ID ```typescript const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); const response = await client.academicFetch({ ids: [ "10.1038/nature12373", // DOI "12345678", // PubMed ID "2301.00001", // ArXiv ID "arXiv:2507.16298v1", // ArXiv with prefix "ED123456", // ERIC ID "0f40b1f08821e22e859c6050916cec3667778613", // Semantic Scholar ID ], fields: ["title", "authors", "year", "abstract", "doi"], // Optional: specify fields }); // Handle successful results response.results.forEach((pub) => { console.log(`Title: ${pub.title}`); console.log(`Provider: ${pub.detectedProvider}`); console.log(`Requested as: ${pub.id}`); }); // Handle errors for IDs that couldn't be fetched response.errors?.forEach((error) => { console.log(`Failed to fetch ${error.id}: ${error.error}`); }); ``` ### Error Handling ```typescript import { PDFVector, PDFVectorError } from "pdfvector"; const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); try { const result = await client.parse({ url: "https://example.com/document.pdf", }); console.log(result.markdown); } catch (error) { if (error instanceof PDFVectorError) { console.error(`API Error: ${error.message}`); console.error(`Status: ${error.status}`); console.error(`Code: ${error.code}`); } else { console.error("Unexpected Error:", error); } } ``` ## API Reference The client class for interacting with the PDF Vector API. ### Constructor ```typescript new PDFVector(config: PDFVectorConfig) ``` **Parameters:** - `config.apiKey` (string): Your PDF Vector API key - `config.baseUrl` (string, optional): Custom base URL (defaults to `https://www.pdfvector.com`) ### Methods #### `parse(request)` Parse a PDF or Word document and convert it to markdown. **Parameters:** For URL parsing: ```typescript { url: string; // Direct URL to PDF/Word document useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto' } ``` For data parsing: ```typescript { data: string | Buffer | Uint8Array | ArrayBuffer | Blob | ReadableStream; // Direct data of PDF/Word document contentType: string; // MIME type (e.g., 'application/pdf') useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto' } ``` **Returns:** ```typescript { markdown: string; // Extracted content as markdown pageCount: number; // Number of pages processed creditCount: number; // Credits consumed (1-2 per page) usedLLM: boolean; // Whether AI enhancement was used } ``` #### LLM Usage Options - **`auto`** (default): Automatically decide if AI enhancement is needed (1-2 credits per page) - **`never`**: Standard parsing without AI (1 credit per page) - **`always`**: Force AI enhancement (2 credits per page) **Note:** Free plans are limited to `useLLM: 'never'`. Upgrade to a paid plan for AI enhancement. #### Supported File Types ##### PDF Documents - `application/pdf` - `application/x-pdf` - `application/acrobat` - `application/vnd.pdf` - `text/pdf` - `text/x-pdf` ##### Word Documents - `application/msword` (.doc) - `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (.docx) #### Usage Limits - **Processing timeout**: 3 minutes per document - **File size**: No explicit limit, but larger files usually have more pages and consume more credits #### Cost - **Credits**: Consumed per page (1-2 credits depending on LLM usage) #### Common error codes: - `url-not-found`: Document URL not accessible - `unsupported-content-type`: File type not supported - `timeout-error`: Processing timeout (3 minutes max) - `payment-required`: Usage limit reached ### `academicSearch(request)` Search academic publications across multiple databases. **Parameters:** ```typescript { query: string; // Search query providers?: AcademicSearchProvider[]; // Databases to search (default: ["semantic-scholar"]) offset?: number; // Pagination offset (default: 0) limit?: number; // Results per page, 1-100 (default: 20) yearFrom?: number; // Filter by publication year (from) (min: 1900) yearTo?: number; // Filter by publication year (to) (max: 2050) fields?: AcademicSearchPublicationField[]; // Fields to include in response } ``` **Supported Providers:** - `"semantic-scholar"` - [Semantic Scholar](https://www.semanticscholar.org/) - `"arxiv"` - [ArXiv](https://arxiv.org/) - `"pubmed"` - [PubMed](https://pubmed.ncbi.nlm.nih.gov/) - `"google-scholar"` - [Google Scholar](https://scholar.google.com/) - `"eric"` - [ERIC](https://eric.ed.gov/) **Available Fields:** - Basic fields: `"id"`, `"doi"`, `"title"`, `"url"`, `"providerURL"`, `"authors"`, `"date"`, `"year"`, `"totalCitations"`, `"totalReferences"`, `"abstract"`, `"pdfURL"`, `"provider"` - Extended field: `"providerData"` - Provider-specific metadata **Returns:** ```typescript { estimatedTotalResults: number; // Total results available results: AcademicSearchPublication[]; // Array of publications errors?: AcademicSearchProviderError[]; // Any provider errors } ``` #### Cost - **Credits**: 2 credits per search. ### `academicFetch(request)` / `fetch(request)` Fetch specific academic publications by their IDs with automatic provider detection. **Parameters:** ```typescript { ids: string[]; // Array of publication IDs to fetch fields?: AcademicSearchPublicationField[]; // Fields to include in response } ``` **Supported ID Types:** - **DOI**: e.g., `"10.1038/nature12373"` - **PubMed ID**: e.g., `"12345678"` (numeric ID) - **ArXiv ID**: e.g., `"2301.00001"` or `"arXiv:2301.00001"` or `"math.GT/0309136"` - **Semantic Scholar ID**: e.g., `"0f40b1f08821e22e859c6050916cec3667778613"` - **ERIC ID**: e.g., `"ED123456"` **Returns:** ```typescript { results: AcademicFetchResult[]; // Successfully fetched publications errors?: AcademicFetchError[]; // Errors for IDs that couldn't be fetched } ``` Each result includes: ```typescript { id: string; // The ID that was used to fetch detectedProvider: string; // Provider that was used // ... all publication fields (title, authors, abstract, etc.) } ``` #### Cost - **Credits**: 2 credit per fetch. ## TypeScript Support The SDK is written in TypeScript and includes full type definitions: ```typescript import type { // Core classes PDFVector, PDFVectorConfig, PDFVectorError, // Parse API types ParseURLRequest, ParseDataRequest, ParseResponse, // Academic Search API types SearchRequest, AcademicSearchResponse, AcademicSearchPublication, AcademicSearchProvider, AcademicSearchAuthor, AcademicSearchPublicationField, // Academic Fetch API types FetchRequest, AcademicFetchResponse, AcademicFetchResult, AcademicFetchError, // Provider-specific data types AcademicSearchSemanticScholarData, AcademicSearchGoogleScholarData, AcademicSearchPubMedData, AcademicSearchArxivData, AcademicSearchEricData, } from "pdfvector"; // Constants import { AcademicSearchProviderValues, // Array of valid providers AcademicSearchPublicationFieldValues, // Array of valid fields } from "pdfvector"; ``` ## Node.js Support - **Node.js version**: Node.js 20+ - **ESM**: Supports ES modules (CommonJS is not supported) - **Dependencies**: Uses standard `fetch` API ## Examples ### Batch Processing ```typescript const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); const documents = [ "https://example.com/doc1.pdf", "https://example.com/doc2.pdf", ]; const results = await Promise.all( documents.map((url) => client.parse({ url, useLLM: "auto" })), ); results.forEach((result, index) => { console.log(`Document ${index + 1}:`); console.log(`Pages: ${result.pageCount}`); console.log(`Credits: ${result.creditCount}`); }); ``` ### Academic Search with Pagination ```typescript const client = new PDFVector({ apiKey: "pdfvector_api_key_here" }); let offset = 0; const limit = 50; const allResults = []; // Fetch first page let response = await client.academicSearch({ query: "climate change", providers: ["semantic-scholar", "arxiv"], offset, limit, }); allResults.push(...response.results); // Fetch more pages as needed while ( allResults.length < response.estimatedTotalResults && allResults.length < 200 ) { offset += limit; response = await client.academicSearch({ query: "climate change", providers: ["semantic-scholar", "arxiv"], offset, limit, }); allResults.push(...response.results); } console.log(`Fetched ${allResults.length} publications`); ``` ### Custom Base URL ```typescript // For development or custom deployments const client = new PDFVector({ apiKey: "pdfvector_api_key_here", baseUrl: "https://pdfvector.acme.com", }); ``` ## Support - **API Reference (Scalar)**: [pdfvector.com/v1/api/scalar](https://www.pdfvector.com/v1/api/scalar) - **API Reference (Swagger)**: [pdfvector.com/v1/api/swagger](https://www.pdfvector.com/v1/api/swagger) - **Dashboard**: [pdfvector.com/dashboard](https://www.pdfvector.com/dashboard) ## License This SDK is licensed under the MIT License.