pdfvector
Version:
Official TypeScript/JavaScript SDK for PDF Vector API - Parse PDFs to markdown and search academic publications across multiple databases
821 lines (633 loc) • 21.6 kB
Markdown
# PDF Vector TypeScript/JavaScript SDK
The official TypeScript/JavaScript SDK for the PDF Vector API: Convert PDF and Word documents to clean, structured markdown format with optional AI enhancement, ask questions about documents using AI, extract structured data from documents with JSON Schema, search across multiple academic databases with a unified API, and fetch specific publications by DOI, PubMed ID, ArXiv ID, and more.
## Installation
```bash
npm install pdfvector
# or
yarn add pdfvector
# or
pnpm add pdfvector
# or
bun add pdfvector
```
## Quick Start
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
// Parse from document URL or data
const parseResult = await client.parse({
url: "https://example.com/document.pdf",
useLLM: "auto",
});
console.log(parseResult.markdown); // Return clean markdown
console.log(
`Pages: ${parseResult.pageCount}, Credits: ${parseResult.creditCount}`,
);
// Ask questions about documents
const askResult = await client.ask({
url: "https://example.com/research-paper.pdf",
prompt: "What are the key findings and conclusions?",
});
console.log(askResult.markdown); // AI-generated answer in markdown format
console.log(`Pages: ${askResult.pageCount}, Credits: ${askResult.creditCount}`);
// Extract structured data using JSON Schema
const extractResult = await client.extract({
url: "https://example.com/research-paper.pdf",
prompt: "Extract the research information",
schema: {
type: "object",
properties: {
title: { type: "string" },
authors: { type: "array", items: { type: "string" } },
abstract: { type: "string" },
findings: { type: "array", items: { type: "string" } },
},
required: ["title", "abstract"],
additionalProperties: false,
},
});
console.log(extractResult.data); // Structured JSON output matching the schema
```
## Authentication
Get your API key from the [PDF Vector dashboard](https://www.pdfvector.com/api-keys). The SDK requires a valid API key for all operations.
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
```
## Usage Examples
### Parse from URL
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.parse({
url: "https://arxiv.org/pdf/2301.00001.pdf",
useLLM: "auto",
});
console.log(result.markdown);
```
### Parse from data
```typescript
import { readFile } from "fs/promises";
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.parse({
data: await readFile("document.pdf"),
contentType: "application/pdf",
useLLM: "auto",
});
console.log(result.markdown);
```
### Ask questions about documents
Ask questions about PDF and Word documents using AI and get natural language answers.
#### Ask from URL
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.ask({
url: "https://arxiv.org/pdf/2301.00001.pdf",
prompt: "What methodology was used in this research?",
});
console.log(result.markdown); // AI-generated answer in markdown format
console.log(`Document has ${result.pageCount} pages`);
console.log(`Cost: ${result.creditCount} credits`);
```
#### Ask from file data
```typescript
import { readFile } from "fs/promises";
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.ask({
data: await readFile("research-paper.pdf"),
contentType: "application/pdf",
prompt: "Summarize the main findings and their implications",
});
console.log(result.markdown);
```
### Extract structured data from documents
Extract structured data from PDF and Word documents using AI and JSON Schema.
#### Extract from URL
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.extract({
url: "https://example.com/invoice.pdf",
prompt: "Extract invoice details",
schema: {
type: "object",
properties: {
invoiceNumber: { type: "string" },
date: { type: "string" },
totalAmount: { type: "number" },
items: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
price: { type: "number" },
},
},
},
},
required: ["invoiceNumber", "date", "totalAmount", "items"],
additionalProperties: false,
},
});
console.log(result.data); // Structured data matching the schema
console.log(`Cost: ${result.creditCount} credits`);
```
#### Extract from file data
```typescript
import { readFile } from "fs/promises";
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.extract({
data: await readFile("research-paper.pdf"),
contentType: "application/pdf",
prompt: "Extract research paper metadata",
schema: {
type: "object",
properties: {
title: { type: "string" },
authors: { type: "array", items: { type: "string" } },
abstract: { type: "string" },
keywords: { type: "array", items: { type: "string" } },
publicationDate: { type: "string" },
},
required: ["title", "authors", "abstract"],
additionalProperties: false,
},
});
console.log(result.data);
```
### Search academic publications
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const searchResponse = await client.academicSearch({
query: "quantum computing",
providers: ["semantic-scholar", "arxiv", "pubmed"], // Search across multiple academic databases
limit: 20,
yearFrom: 2021,
yearTo: 2024,
});
searchResponse.results.forEach((publication) => {
console.log(`Title: ${publication.title}`);
console.log(`Authors: ${publication.authors?.map((a) => a.name).join(", ")}`);
console.log(`Year: ${publication.year}`);
console.log(`Abstract: ${publication.abstract}`);
console.log("---");
});
```
### Search with Provider-Specific Data
```typescript
const searchResponse = await client.academicSearch({
query: "CRISPR gene editing",
providers: ["semantic-scholar"],
fields: ["title", "authors", "year", "providerData"], //providerData is Provider-Specific data field
});
searchResponse.results.forEach((pub) => {
if (pub.provider === "semantic-scholar" && pub.providerData) {
const data = pub.providerData;
console.log(`Influential Citations: ${data.influentialCitationCount}`);
console.log(`Fields of Study: ${data.fieldsOfStudy?.join(", ")}`);
}
});
```
### Fetch Academic Publications by ID
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const response = await client.academicFetch({
ids: [
"10.1038/nature12373", // DOI
"12345678", // PubMed ID
"2301.00001", // ArXiv ID
"arXiv:2507.16298v1", // ArXiv with prefix
"ED123456", // ERIC ID
"0f40b1f08821e22e859c6050916cec3667778613", // Semantic Scholar ID
],
fields: ["title", "authors", "year", "abstract", "doi"], // Optional: specify fields
});
// Handle successful results
response.results.forEach((pub) => {
console.log(`Title: ${pub.title}`);
console.log(`Provider: ${pub.detectedProvider}`);
console.log(`Requested as: ${pub.id}`);
});
// Handle errors for IDs that couldn't be fetched
response.errors?.forEach((error) => {
console.log(`Failed to fetch ${error.id}: ${error.error}`);
});
```
### Error Handling
```typescript
import { PDFVector, PDFVectorError } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
try {
const result = await client.parse({
url: "https://example.com/document.pdf",
});
console.log(result.markdown);
} catch (error) {
if (error instanceof PDFVectorError) {
console.error(`API Error: ${error.message}`);
console.error(`Status: ${error.status}`);
console.error(`Code: ${error.code}`);
} else {
console.error("Unexpected Error:", error);
}
}
```
## API Reference
The client class for interacting with the PDF Vector API.
### Constructor
```typescript
new PDFVector(config: PDFVectorConfig)
```
**Parameters:**
- `config.apiKey` (string): Your PDF Vector API key
- `config.baseUrl` (string, optional): Custom base URL (defaults to `https://www.pdfvector.com`)
### Methods
#### `parse(request)`
Parse a PDF or Word document and convert it to markdown.
**Parameters:**
For URL parsing:
```typescript
{
url: string; // Direct URL to PDF/Word document
useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto'
}
```
For data parsing:
```typescript
{
data: string | Buffer | Uint8Array | ArrayBuffer | Blob | ReadableStream; // Direct data of PDF/Word document
contentType: string; // MIME type (e.g., 'application/pdf')
useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto'
}
```
**Returns:**
```typescript
{
markdown: string; // Extracted content as markdown
pageCount: number; // Number of pages processed
creditCount: number; // Credits consumed (1-2 per page)
usedLLM: boolean; // Whether AI enhancement was used
}
```
#### LLM Usage Options
- **`auto`** (default): Automatically decide if AI enhancement is needed (1-2 credits per page)
- **`never`**: Standard parsing without AI (1 credit per page)
- **`always`**: Force AI enhancement (2 credits per page)
**Note:** Free plans are limited to `useLLM: 'never'`. Upgrade to a paid plan for AI enhancement.
#### Supported File Types
##### PDF Documents
- `application/pdf`
- `application/x-pdf`
- `application/acrobat`
- `application/vnd.pdf`
- `text/pdf`
- `text/x-pdf`
##### Word Documents
- `application/msword` (.doc)
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (.docx)
#### Usage Limits
- **Processing timeout**: 3 minutes per document
- **File size**: No explicit limit, but larger files usually have more pages and consume more credits
#### Cost
- **Credits**: Consumed per page (1-2 credits depending on LLM usage)
#### Common error codes:
- `url-not-found`: Document URL not accessible
- `unsupported-content-type`: File type not supported
- `timeout-error`: Processing timeout (3 minutes max)
- `payment-required`: Usage limit reached
#### `ask(request)`
Ask questions about PDF or Word documents and get natural language answers.
**Parameters:**
For URL input:
```typescript
{
url: string; // Direct URL to PDF/Word document
prompt: string; // The question you want to ask about the document
}
```
For data input:
```typescript
{
data: string | Buffer | Uint8Array | ArrayBuffer | Blob | ReadableStream; // Document data
contentType: string; // MIME type (e.g., 'application/pdf')
prompt: string; // The question you want to ask about the document
}
```
**Returns:**
```typescript
{
markdown: string; // AI-generated answer in markdown format
pageCount: number; // Number of pages in the document
creditCount: number; // Credits consumed (3 per page)
}
```
#### Document Q&A Features
- **Natural language responses**: AI provides answers in clear, readable markdown format
- **Contextual understanding**: AI analyzes the entire document to provide relevant answers
- **Multiple formats**: Supports both PDF and Word documents
- **Page-based pricing**: 3 credits per page in the document
#### Cost
- **Credits**: 3 credits per page in the document
#### Common error codes:
- `url-not-found`: Document URL not accessible
- `unsupported-content-type`: File type not supported
- `page-count-not-found`: Unable to detect page count
- `timeout-error`: Processing timeout
- `payment-required`: Usage limit reached
#### `extract(request)`
Extract structured data from PDF or Word documents using AI and JSON Schema.
**Parameters:**
For URL input:
```typescript
{
url: string; // Direct URL to PDF/Word document
prompt: string; // Instructions for extracting structured data
schema: object; // JSON Schema defining the structure of expected output
}
```
For data input:
```typescript
{
data: string | Buffer | Uint8Array | ArrayBuffer | Blob | ReadableStream; // Document data
contentType: string; // MIME type (e.g., 'application/pdf')
prompt: string; // Instructions for extracting structured data
schema: object; // JSON Schema defining the structure of expected output
}
```
**Returns:**
```typescript
{
data: object; // Structured data matching the provided schema
pageCount: number; // Number of pages in the document
creditCount: number; // Credits consumed (3 per page)
}
```
#### JSON Schema Requirements
- Must be a valid JSON Schema following the specification
- Must include `additionalProperties: false` at the object level
- Can define complex nested structures
- Supports all standard JSON Schema features
#### Extract Features
- **Schema validation**: Ensures extracted data matches your exact requirements
- **Complex structures**: Supports nested objects, arrays, and various data types
- **Reliable extraction**: AI follows your schema strictly for consistent results
- **Multiple formats**: Supports both PDF and Word documents
#### Cost
- **Credits**: 3 credits per page in the document
#### Common error codes:
- `url-not-found`: Document URL not accessible
- `unsupported-content-type`: File type not supported
- `invalid-schema`: JSON Schema is invalid or missing additionalProperties
- `timeout-error`: Processing timeout
- `payment-required`: Usage limit reached
### `academicSearch(request)`
Search academic publications across multiple databases.
**Parameters:**
```typescript
{
query: string; // Search query
providers?: AcademicSearchProvider[]; // Databases to search (default: ["semantic-scholar"])
offset?: number; // Pagination offset (default: 0)
limit?: number; // Results per page, 1-100 (default: 20)
yearFrom?: number; // Filter by publication year (from) (min: 1900)
yearTo?: number; // Filter by publication year (to) (max: 2050)
fields?: AcademicSearchPublicationField[]; // Fields to include in response
}
```
**Supported Providers:**
- `"semantic-scholar"` - [Semantic Scholar](https://www.semanticscholar.org/)
- `"arxiv"` - [ArXiv](https://arxiv.org/)
- `"pubmed"` - [PubMed](https://pubmed.ncbi.nlm.nih.gov/)
- `"google-scholar"` - [Google Scholar](https://scholar.google.com/)
- `"eric"` - [ERIC](https://eric.ed.gov/)
**Available Fields:**
- Basic fields: `"id"`, `"doi"`, `"title"`, `"url"`, `"providerURL"`, `"authors"`, `"date"`, `"year"`, `"totalCitations"`, `"totalReferences"`, `"abstract"`, `"pdfURL"`, `"provider"`
- Extended field: `"providerData"` - Provider-specific metadata
**Returns:**
```typescript
{
estimatedTotalResults: number; // Total results available
results: AcademicSearchPublication[]; // Array of publications
errors?: AcademicSearchProviderError[]; // Any provider errors
}
```
#### Cost
- **Credits**: 2 credits per search.
### `academicFetch(request)` / `fetch(request)`
Fetch specific academic publications by their IDs with automatic provider detection.
**Parameters:**
```typescript
{
ids: string[]; // Array of publication IDs to fetch
fields?: AcademicSearchPublicationField[]; // Fields to include in response
}
```
**Supported ID Types:**
- **DOI**: e.g., `"10.1038/nature12373"`
- **PubMed ID**: e.g., `"12345678"` (numeric ID)
- **ArXiv ID**: e.g., `"2301.00001"` or `"arXiv:2301.00001"` or `"math.GT/0309136"`
- **Semantic Scholar ID**: e.g., `"0f40b1f08821e22e859c6050916cec3667778613"`
- **ERIC ID**: e.g., `"ED123456"`
**Returns:**
```typescript
{
results: AcademicFetchResult[]; // Successfully fetched publications
errors?: AcademicFetchError[]; // Errors for IDs that couldn't be fetched
}
```
Each result includes:
```typescript
{
id: string; // The ID that was used to fetch
detectedProvider: string; // Provider that was used
// ... all publication fields (title, authors, abstract, etc.)
}
```
#### Cost
- **Credits**: 2 credit per fetch.
## TypeScript Support
The SDK is written in TypeScript and includes full type definitions:
```typescript
import type {
// Core classes
PDFVector,
PDFVectorConfig,
PDFVectorError,
// Parse API types
ParseURLRequest,
ParseDataRequest,
ParseResponse,
// Ask API types
AskURLRequest,
AskDataRequest,
AskResponse,
// Extract API types
ExtractURLRequest,
ExtractDataRequest,
ExtractResponse,
// Academic Search API types
SearchRequest,
AcademicSearchResponse,
AcademicSearchPublication,
AcademicSearchProvider,
AcademicSearchAuthor,
AcademicSearchPublicationField,
// Academic Fetch API types
FetchRequest,
AcademicFetchResponse,
AcademicFetchResult,
AcademicFetchError,
// Provider-specific data types
AcademicSearchSemanticScholarData,
AcademicSearchGoogleScholarData,
AcademicSearchPubMedData,
AcademicSearchArxivData,
AcademicSearchEricData,
} from "pdfvector";
// Constants
import {
AcademicSearchProviderValues, // Array of valid providers
AcademicSearchPublicationFieldValues, // Array of valid fields
} from "pdfvector";
```
## Node.js Support
- **Node.js version**: Node.js 20+
- **ESM**: Supports ES modules (CommonJS is not supported)
- **Dependencies**: Uses standard `fetch` API
## Examples
### Batch Processing
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const documents = [
"https://example.com/doc1.pdf",
"https://example.com/doc2.pdf",
];
const results = await Promise.all(
documents.map((url) => client.parse({ url, useLLM: "auto" })),
);
results.forEach((result, index) => {
console.log(`Document ${index + 1}:`);
console.log(`Pages: ${result.pageCount}`);
console.log(`Credits: ${result.creditCount}`);
});
```
### Document Q&A and Data Extraction
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
// Ask multiple questions about the same document
const questions = [
"What is the main hypothesis?",
"What methodology was used?",
"What are the key findings?",
"What are the limitations mentioned?",
];
const documentUrl = "https://example.com/research-paper.pdf";
const answers = await Promise.all(
questions.map((prompt) => client.ask({ url: documentUrl, prompt })),
);
answers.forEach((result, index) => {
console.log(`\nQuestion: ${questions[index]}`);
console.log(`Answer: ${result.markdown}`);
});
// Extract structured data using the extract endpoint
const structuredData = await client.extract({
url: documentUrl,
prompt: "Extract comprehensive research information from this paper",
schema: {
type: "object",
properties: {
title: { type: "string" },
authors: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
affiliation: { type: "string" },
},
additionalProperties: false,
},
},
abstract: { type: "string" },
methodology: {
type: "object",
properties: {
approach: { type: "string" },
dataCollection: { type: "string" },
sampleSize: { type: "number" },
},
additionalProperties: false,
},
findings: {
type: "array",
items: { type: "string" },
},
limitations: {
type: "array",
items: { type: "string" },
},
conclusions: { type: "string" },
},
required: ["title", "abstract", "findings"],
additionalProperties: false,
},
});
console.log(
"Structured Research Data:",
JSON.stringify(structuredData.data, null, 2),
);
// Note: Each operation consumes credits based on page count
const totalCredits =
answers.reduce((sum, result) => sum + result.creditCount, 0) +
structuredData.creditCount;
console.log(`\nTotal credits used: ${totalCredits}`);
```
### Academic Search with Pagination
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
let offset = 0;
const limit = 50;
const allResults = [];
// Fetch first page
let response = await client.academicSearch({
query: "climate change",
providers: ["semantic-scholar", "arxiv"],
offset,
limit,
});
allResults.push(...response.results);
// Fetch more pages as needed
while (
allResults.length < response.estimatedTotalResults &&
allResults.length < 200
) {
offset += limit;
response = await client.academicSearch({
query: "climate change",
providers: ["semantic-scholar", "arxiv"],
offset,
limit,
});
allResults.push(...response.results);
}
console.log(`Fetched ${allResults.length} publications`);
```
### Custom Base URL
```typescript
// For development or custom deployments
const client = new PDFVector({
apiKey: "pdfvector_api_key_here",
baseUrl: "https://pdfvector.acme.com",
});
```
## Support
- **API Reference**: [pdfvector.com/v1/api/scalar](https://www.pdfvector.com/v1/api/scalar)
- **Dashboard**: [pdfvector.com/dashboard](https://www.pdfvector.com/dashboard)
## License
This SDK is licensed under the MIT License.