pdfvector
Version:
Official TypeScript/JavaScript SDK for PDF Vector API - Parse PDFs to markdown and search academic publications across multiple databases
481 lines (362 loc) • 12.4 kB
Markdown
# PDF Vector TypeScript/JavaScript SDK
The official TypeScript/JavaScript SDK for the PDF Vector API: Convert PDF and Word documents to clean, structured markdown format with optional AI enhancement, search across multiple academic databases with a unified API, and fetch specific publications by DOI, PubMed ID, ArXiv ID, and more.
## Installation
```bash
npm install pdfvector
# or
yarn add pdfvector
# or
pnpm add pdfvector
# or
bun add pdfvector
```
## Quick Start
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
// Parse from document URL or data
const parseResult = await client.parse({
url: "https://example.com/document.pdf",
useLLM: "auto",
});
console.log(parseResult.markdown); // Return clean markdown
console.log(
`Pages: ${parseResult.pageCount}, Credits: ${parseResult.creditCount}`,
);
```
## Authentication
Get your API key from the [PDF Vector dashboard](https://www.pdfvector.com/api-keys). The SDK requires a valid API key for all operations.
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
```
## Usage Examples
### Parse from URL
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.parse({
url: "https://arxiv.org/pdf/2301.00001.pdf",
useLLM: "auto",
});
console.log(result.markdown);
```
### Parse from data
```typescript
import { readFile } from "fs/promises";
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const result = await client.parse({
data: await readFile("document.pdf"),
contentType: "application/pdf",
useLLM: "auto",
});
console.log(result.markdown);
```
### Search academic publications
```typescript
import { PDFVector } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const searchResponse = await client.academicSearch({
query: "quantum computing",
providers: ["semantic-scholar", "arxiv", "pubmed"], // Search across multiple academic databases
limit: 20,
yearFrom: 2021,
yearTo: 2024,
});
searchResponse.results.forEach((publication) => {
console.log(`Title: ${publication.title}`);
console.log(`Authors: ${publication.authors?.map((a) => a.name).join(", ")}`);
console.log(`Year: ${publication.year}`);
console.log(`Abstract: ${publication.abstract}`);
console.log("---");
});
```
### Search with Provider-Specific Data
```typescript
const searchResponse = await client.academicSearch({
query: "CRISPR gene editing",
providers: ["semantic-scholar"],
fields: ["title", "authors", "year", "providerData"], //providerData is Provider-Specific data field
});
searchResponse.results.forEach((pub) => {
if (pub.provider === "semantic-scholar" && pub.providerData) {
const data = pub.providerData;
console.log(`Influential Citations: ${data.influentialCitationCount}`);
console.log(`Fields of Study: ${data.fieldsOfStudy?.join(", ")}`);
}
});
```
### Fetch Academic Publications by ID
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const response = await client.academicFetch({
ids: [
"10.1038/nature12373", // DOI
"12345678", // PubMed ID
"2301.00001", // ArXiv ID
"arXiv:2507.16298v1", // ArXiv with prefix
"ED123456", // ERIC ID
"0f40b1f08821e22e859c6050916cec3667778613", // Semantic Scholar ID
],
fields: ["title", "authors", "year", "abstract", "doi"], // Optional: specify fields
});
// Handle successful results
response.results.forEach((pub) => {
console.log(`Title: ${pub.title}`);
console.log(`Provider: ${pub.detectedProvider}`);
console.log(`Requested as: ${pub.id}`);
});
// Handle errors for IDs that couldn't be fetched
response.errors?.forEach((error) => {
console.log(`Failed to fetch ${error.id}: ${error.error}`);
});
```
### Error Handling
```typescript
import { PDFVector, PDFVectorError } from "pdfvector";
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
try {
const result = await client.parse({
url: "https://example.com/document.pdf",
});
console.log(result.markdown);
} catch (error) {
if (error instanceof PDFVectorError) {
console.error(`API Error: ${error.message}`);
console.error(`Status: ${error.status}`);
console.error(`Code: ${error.code}`);
} else {
console.error("Unexpected Error:", error);
}
}
```
## API Reference
The client class for interacting with the PDF Vector API.
### Constructor
```typescript
new PDFVector(config: PDFVectorConfig)
```
**Parameters:**
- `config.apiKey` (string): Your PDF Vector API key
- `config.baseUrl` (string, optional): Custom base URL (defaults to `https://www.pdfvector.com`)
### Methods
#### `parse(request)`
Parse a PDF or Word document and convert it to markdown.
**Parameters:**
For URL parsing:
```typescript
{
url: string; // Direct URL to PDF/Word document
useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto'
}
```
For data parsing:
```typescript
{
data: string | Buffer | Uint8Array | ArrayBuffer | Blob | ReadableStream; // Direct data of PDF/Word document
contentType: string; // MIME type (e.g., 'application/pdf')
useLLM?: 'auto' | 'always' | 'never'; // Default: 'auto'
}
```
**Returns:**
```typescript
{
markdown: string; // Extracted content as markdown
pageCount: number; // Number of pages processed
creditCount: number; // Credits consumed (1-2 per page)
usedLLM: boolean; // Whether AI enhancement was used
}
```
#### LLM Usage Options
- **`auto`** (default): Automatically decide if AI enhancement is needed (1-2 credits per page)
- **`never`**: Standard parsing without AI (1 credit per page)
- **`always`**: Force AI enhancement (2 credits per page)
**Note:** Free plans are limited to `useLLM: 'never'`. Upgrade to a paid plan for AI enhancement.
#### Supported File Types
##### PDF Documents
- `application/pdf`
- `application/x-pdf`
- `application/acrobat`
- `application/vnd.pdf`
- `text/pdf`
- `text/x-pdf`
##### Word Documents
- `application/msword` (.doc)
- `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (.docx)
#### Usage Limits
- **Processing timeout**: 3 minutes per document
- **File size**: No explicit limit, but larger files usually have more pages and consume more credits
#### Cost
- **Credits**: Consumed per page (1-2 credits depending on LLM usage)
#### Common error codes:
- `url-not-found`: Document URL not accessible
- `unsupported-content-type`: File type not supported
- `timeout-error`: Processing timeout (3 minutes max)
- `payment-required`: Usage limit reached
### `academicSearch(request)`
Search academic publications across multiple databases.
**Parameters:**
```typescript
{
query: string; // Search query
providers?: AcademicSearchProvider[]; // Databases to search (default: ["semantic-scholar"])
offset?: number; // Pagination offset (default: 0)
limit?: number; // Results per page, 1-100 (default: 20)
yearFrom?: number; // Filter by publication year (from) (min: 1900)
yearTo?: number; // Filter by publication year (to) (max: 2050)
fields?: AcademicSearchPublicationField[]; // Fields to include in response
}
```
**Supported Providers:**
- `"semantic-scholar"` - [Semantic Scholar](https://www.semanticscholar.org/)
- `"arxiv"` - [ArXiv](https://arxiv.org/)
- `"pubmed"` - [PubMed](https://pubmed.ncbi.nlm.nih.gov/)
- `"google-scholar"` - [Google Scholar](https://scholar.google.com/)
- `"eric"` - [ERIC](https://eric.ed.gov/)
**Available Fields:**
- Basic fields: `"id"`, `"doi"`, `"title"`, `"url"`, `"providerURL"`, `"authors"`, `"date"`, `"year"`, `"totalCitations"`, `"totalReferences"`, `"abstract"`, `"pdfURL"`, `"provider"`
- Extended field: `"providerData"` - Provider-specific metadata
**Returns:**
```typescript
{
estimatedTotalResults: number; // Total results available
results: AcademicSearchPublication[]; // Array of publications
errors?: AcademicSearchProviderError[]; // Any provider errors
}
```
#### Cost
- **Credits**: 2 credits per search.
### `academicFetch(request)` / `fetch(request)`
Fetch specific academic publications by their IDs with automatic provider detection.
**Parameters:**
```typescript
{
ids: string[]; // Array of publication IDs to fetch
fields?: AcademicSearchPublicationField[]; // Fields to include in response
}
```
**Supported ID Types:**
- **DOI**: e.g., `"10.1038/nature12373"`
- **PubMed ID**: e.g., `"12345678"` (numeric ID)
- **ArXiv ID**: e.g., `"2301.00001"` or `"arXiv:2301.00001"` or `"math.GT/0309136"`
- **Semantic Scholar ID**: e.g., `"0f40b1f08821e22e859c6050916cec3667778613"`
- **ERIC ID**: e.g., `"ED123456"`
**Returns:**
```typescript
{
results: AcademicFetchResult[]; // Successfully fetched publications
errors?: AcademicFetchError[]; // Errors for IDs that couldn't be fetched
}
```
Each result includes:
```typescript
{
id: string; // The ID that was used to fetch
detectedProvider: string; // Provider that was used
// ... all publication fields (title, authors, abstract, etc.)
}
```
#### Cost
- **Credits**: 2 credit per fetch.
## TypeScript Support
The SDK is written in TypeScript and includes full type definitions:
```typescript
import type {
// Core classes
PDFVector,
PDFVectorConfig,
PDFVectorError,
// Parse API types
ParseURLRequest,
ParseDataRequest,
ParseResponse,
// Academic Search API types
SearchRequest,
AcademicSearchResponse,
AcademicSearchPublication,
AcademicSearchProvider,
AcademicSearchAuthor,
AcademicSearchPublicationField,
// Academic Fetch API types
FetchRequest,
AcademicFetchResponse,
AcademicFetchResult,
AcademicFetchError,
// Provider-specific data types
AcademicSearchSemanticScholarData,
AcademicSearchGoogleScholarData,
AcademicSearchPubMedData,
AcademicSearchArxivData,
AcademicSearchEricData,
} from "pdfvector";
// Constants
import {
AcademicSearchProviderValues, // Array of valid providers
AcademicSearchPublicationFieldValues, // Array of valid fields
} from "pdfvector";
```
## Node.js Support
- **Node.js version**: Node.js 20+
- **ESM**: Supports ES modules (CommonJS is not supported)
- **Dependencies**: Uses standard `fetch` API
## Examples
### Batch Processing
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
const documents = [
"https://example.com/doc1.pdf",
"https://example.com/doc2.pdf",
];
const results = await Promise.all(
documents.map((url) => client.parse({ url, useLLM: "auto" })),
);
results.forEach((result, index) => {
console.log(`Document ${index + 1}:`);
console.log(`Pages: ${result.pageCount}`);
console.log(`Credits: ${result.creditCount}`);
});
```
### Academic Search with Pagination
```typescript
const client = new PDFVector({ apiKey: "pdfvector_api_key_here" });
let offset = 0;
const limit = 50;
const allResults = [];
// Fetch first page
let response = await client.academicSearch({
query: "climate change",
providers: ["semantic-scholar", "arxiv"],
offset,
limit,
});
allResults.push(...response.results);
// Fetch more pages as needed
while (
allResults.length < response.estimatedTotalResults &&
allResults.length < 200
) {
offset += limit;
response = await client.academicSearch({
query: "climate change",
providers: ["semantic-scholar", "arxiv"],
offset,
limit,
});
allResults.push(...response.results);
}
console.log(`Fetched ${allResults.length} publications`);
```
### Custom Base URL
```typescript
// For development or custom deployments
const client = new PDFVector({
apiKey: "pdfvector_api_key_here",
baseUrl: "https://pdfvector.acme.com",
});
```
## Support
- **API Reference (Scalar)**: [pdfvector.com/v1/api/scalar](https://www.pdfvector.com/v1/api/scalar)
- **API Reference (Swagger)**: [pdfvector.com/v1/api/swagger](https://www.pdfvector.com/v1/api/swagger)
- **Dashboard**: [pdfvector.com/dashboard](https://www.pdfvector.com/dashboard)
## License
This SDK is licensed under the MIT License.