UNPKG

@jackchuka/gql-ingest

Version:

A CLI tool for ingesting data from files into a GraphQL API. Supports CSV, JSON, JSONL, and YAML file formats.

639 lines (496 loc) 19 kB
# GQL Ingest [![npm version](https://badge.fury.io/js/%40jackchuka%2Fgql-ingest.svg)](https://badge.fury.io/js/%40jackchuka%2Fgql-ingest) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) A TypeScript library and CLI tool that reads data from multiple formats (CSV, JSON, YAML, JSONL) and ingests it into GraphQL APIs through configurable mutations. ## Features - ✅ **Supported data formats**: CSV, JSON, YAML, JSONL - ✅ **Complex nested data support** for sophisticated GraphQL mutations - ✅ External GraphQL mutation definitions (separate .graphql files) - ✅ Flexible data-to-GraphQL variable mapping via JSON configuration - ✅ Configurable GraphQL endpoint and headers - ✅ **Parallel processing** with dependency management - ✅ Entity-level and row-level concurrency control - ✅ **Retry capabilities** with exponential backoff and configurable error handling - ✅ Comprehensive metrics and progress tracking - ✅ **Event-based progress monitoring** with real-time callbacks - ✅ **Cancellation support** via AbortController pattern ## Installation ### For End Users ```bash # Install globally npm install -g @jackchuka/gql-ingest # Or use with npx (no installation required) npx @jackchuka/gql-ingest --endpoint <url> --config <path> ``` ### For Development ```bash git clone https://github.com/jackchuka/gql-ingest.git cd gql-ingest pnpm install pnpm run build ``` ## Quick Start Initialize a new configuration and start ingesting data in minutes: ```bash # Create a new configuration directory gql-ingest init ./my-config # Add a new entity gql-ingest add users -p ./my-config -f json --fields "id,name,email" # Run ingestion gql-ingest -e https://your-api.com/graphql -c ./my-config ``` ## Usage ### CLI Commands #### Initialize Configuration Create a new configuration directory with example files: ```bash gql-ingest init [path] [options] Options: --no-example Skip creating example entity files --no-config Skip creating config.yaml -f, --force Overwrite existing files -q, --quiet Suppress output ``` This creates: - `data/` - Data files directory - `graphql/` - GraphQL mutation files - `mappings/` - Mapping configuration files - `config.yaml` - Processing configuration - Example entity files (by default) #### Add Entity Add a new entity to an existing configuration: ```bash gql-ingest add <entity-name> [options] Options: -p, --path <path> Config directory path (default: current directory) -f, --format <format> Data format (csv, json, yaml, jsonl) --fields <fields> Comma-separated field names --mutation <name> GraphQL mutation name --no-interactive Skip prompts, use defaults only -q, --quiet Suppress output ``` Interactive mode prompts for format, fields, and mutation name. Use `--no-interactive` with flags for CI/CD. #### Run Ingestion Ingest data from configuration into GraphQL API: ```bash gql-ingest [options] Options: -e, --endpoint <url> GraphQL endpoint URL (required) -c, --config <path> Path to configuration directory (required) -n, --entities <list> Comma-separated list of entities to process -h, --headers <headers> JSON string of headers -f, --format <format> Override data format detection -q, --quiet Suppress output ``` ### CLI Examples ```bash # Basic usage gql-ingest \ -e https://your-graphql-api.com/graphql \ -c ./examples/demo # With authentication headers gql-ingest \ -e https://your-graphql-api.com/graphql \ -c ./examples/demo \ -h '{"Authorization": "Bearer YOUR_TOKEN"}' # Process specific entities only gql-ingest \ -e https://your-graphql-api.com/graphql \ -c ./examples/demo \ -n users,products ``` ### Programmatic API GQL Ingest provides a full programmatic API for integration into your Node.js applications. #### Installation for API Usage ```bash npm install @jackchuka/gql-ingest ``` #### Basic API Usage ```typescript import { GQLIngest, createConsoleLogger } from "@jackchuka/gql-ingest"; // Initialize the client const client = new GQLIngest({ endpoint: "https://your-graphql-api.com/graphql", headers: { Authorization: "Bearer YOUR_TOKEN", }, logger: createConsoleLogger({ prefix: "my-app" }), // Optional: enable logging with prefix }); // Ingest all data from a configuration const result = await client.ingest("./config"); // Check if ingestion was successful if (result.success) { console.log("Ingestion completed successfully"); console.log("Metrics:", result.metrics); } else { console.error("Ingestion failed:", result.errors); } ``` #### Processing Specific Entities ```typescript // Process only specific entities const result = await client.ingestEntities("./config", ["users", "products"]); // Or using the ingest method with options const result = await client.ingest("./config", { entities: ["users", "products"], format: "csv", // Optional: override format detection }); ``` #### Advanced API Usage For more control, you can access the underlying components directly: ```typescript import { GraphQLClientWrapper, DataMapper, DependencyResolver, MetricsCollector, loadConfig, createConsoleLogger, } from "@jackchuka/gql-ingest"; // Create your own custom workflow const logger = createConsoleLogger(); const metrics = new MetricsCollector(); const client = new GraphQLClientWrapper(endpoint, headers, metrics, logger); const mapper = new DataMapper(client, basePath, metrics, logger); // Load configuration const config = loadConfig("./config"); // Process entities with custom logic // ... your custom implementation ``` #### API Methods **GQLIngest Class Methods:** - `constructor(options: GQLIngestOptions)` - Initialize the client - `ingest(configPath: string, options?: IngestOptions)` - Ingest data from a configuration - `ingestEntities(configPath: string, entities: string[])` - Process specific entities - `getMetrics()` - Get current processing metrics - `getMetricsSummary()` - Get formatted metrics summary - `setLogger(logger: Logger)` - Set custom logger - `setHeaders(headers: Record<string, string>)` - Update request headers - `cancel(reason?: string)` - Cancel in-progress ingestion - `processing` - Property indicating if ingestion is in progress #### Event-Based Progress Monitoring GQLIngest extends EventEmitter, enabling real-time progress tracking and cancellation: ```typescript import { GQLIngest } from "@jackchuka/gql-ingest"; const client = new GQLIngest({ endpoint: "https://your-api.com/graphql", eventOptions: { emitRowEvents: true, // Emit events for each row emitProgressEvents: true, // Emit periodic progress progressInterval: 1000, // Progress every 1 second }, }); // Listen for events client.on("started", (p) => console.log(`Starting ${p.totalEntities} entities`)); client.on("progress", (p) => console.log(`${p.progressPercent.toFixed(1)}% complete`)); client.on("entityStart", (p) => console.log(`Processing ${p.entityName}`)); client.on("entityComplete", (p) => console.log(`${p.entityName}: ${p.metrics.successfulRows} rows`), ); client.on("rowSuccess", (p) => console.log(`Row ${p.rowIndex} OK`)); client.on("rowFailure", (p) => console.error(`Row ${p.rowIndex} failed: ${p.error.message}`)); client.on("finished", (p) => console.log(`Done in ${p.durationMs}ms`)); client.on("errored", (p) => console.error(`Error: ${p.error.message}`)); client.on("cancelled", (p) => console.log(`Cancelled: ${p.reason}`)); // Handle graceful shutdown process.on("SIGINT", () => client.cancel("User interrupted")); await client.ingest("./config"); ``` **Available Events:** | Event | When Emitted | Key Payload Fields | | ---------------- | ------------------------ | ------------------------------------------------- | | `started` | Ingestion begins | `configPath`, `entityNames`, `totalWaves` | | `progress` | Periodic interval | `progressPercent`, `successfulRows`, `failedRows` | | `entityStart` | Entity processing begins | `entityName`, `totalRows`, `waveIndex` | | `entityComplete` | Entity processing ends | `entityName`, `metrics`, `success` | | `rowSuccess` | Row mutation succeeds | `entityName`, `rowIndex`, `row`, `result` | | `rowFailure` | Row mutation fails | `entityName`, `rowIndex`, `error` | | `cancelled` | Processing cancelled | `reason`, `metrics`, `elapsedMs` | | `finished` | Processing completes | `metrics`, `durationMs`, `allSuccessful` | | `errored` | Fatal error occurs | `error`, `metrics`, `elapsedMs` | #### Cancellation Support Cancel in-progress ingestion using the `cancel()` method or external AbortController: ```typescript // Method 1: Using cancel() const client = new GQLIngest({ endpoint: "..." }); process.on("SIGINT", () => client.cancel("User interrupted")); await client.ingest("./config"); // Method 2: Using external AbortController const controller = new AbortController(); setTimeout(() => controller.abort("Timeout"), 60000); await client.ingest("./config", { signal: controller.signal }); ``` #### TypeScript Support Full TypeScript support is included with comprehensive type definitions: ```typescript import type { GQLIngestOptions, IngestOptions, IngestResult, ProcessingMetrics, EntityMetrics, // Event types EventOptions, StartedEventPayload, ProgressEventPayload, EntityStartEventPayload, EntityCompleteEventPayload, RowSuccessEventPayload, RowFailureEventPayload, CancelledEventPayload, FinishedEventPayload, ErroredEventPayload, } from "@jackchuka/gql-ingest"; ``` ## Parallel Processing 🚀 GQL Ingest supports advanced parallel processing with dependency management for high-performance data ingestion: ### Key Capabilities - **Entity-level parallelism**: Process multiple entities (users, products, orders) concurrently - **Row-level parallelism**: Process multiple CSV rows within an entity concurrently - **Dependency management**: Ensure entities process in the correct order (e.g., users before orders) - **Smart batching**: Control exactly how many entities/rows process simultaneously - **Real-time metrics**: Track progress, success rates, and performance ### Quick Example ```yaml # config.yaml - Add to your configuration directory parallelProcessing: concurrency: 10 # Process up to 10 CSV rows per entity concurrently entityConcurrency: 3 # Process up to 3 entities simultaneously preserveRowOrder: false # Allow rows to complete out of order for speed # Define dependencies between entities entityDependencies: products: ["users"] # Products must wait for users to complete orders: ["products"] # Orders must wait for products to complete ``` **Performance Impact**: This configuration can process data **10-50x faster** than sequential processing, depending on your GraphQL API's capabilities. 👉 **[Full Parallel Processing Guide](PARALLEL_PROCESSING.md)** - Detailed configuration options, performance tuning, and examples. ## Retry Capabilities 🔄 GQL Ingest includes robust retry functionality to handle transient failures and improve reliability: ### Key Features - **Automatic retries**: Failed GraphQL mutations are retried automatically - **Exponential backoff**: Intelligent delay increases between retry attempts - **Jitter**: Randomization prevents thundering herd problems - **Configurable error codes**: Control which HTTP status codes trigger retries - **Per-entity overrides**: Different retry settings for different entities - **Metrics tracking**: Monitor retry success rates and attempt counts ### Quick Example ```yaml # config.yaml - Add to your configuration directory retry: maxAttempts: 5 # Retry up to 5 times (default: 3) baseDelay: 2000 # Start with 2s delay (default: 1000ms) maxDelay: 60000 # Cap delays at 60s (default: 30000ms) exponentialBackoff: true # Double delay each retry (default: true) retryableStatusCodes: # Which HTTP errors to retry (defaults shown) - 408 # Request Timeout - 429 # Too Many Requests - 500 # Internal Server Error - 502 # Bad Gateway - 503 # Service Unavailable - 504 # Gateway Timeout # Per-entity retry overrides entityConfig: critical-orders: retry: maxAttempts: 10 # More retries for critical data baseDelay: 500 # Faster initial retry ``` **Reliability Impact**: Retry capabilities can improve success rates from 95% to 99.9%+ for APIs with transient failures. ## Selective Entity Processing The `--entities` flag allows you to process specific entities instead of all discovered mappings: - Process multiple entities: `--entities users,products,orders` - Process a single entity: `--entities items` - Entities are processed in dependency order automatically - Missing dependencies will trigger a warning but not prevent execution **Note**: When using `--entities` with entity dependencies defined in `config.yaml`, the tool will warn you about any missing dependencies but will still attempt to process the selected entities. Ensure dependent data exists in your GraphQL API before processing entities with unmet dependencies. ## Configuration The `--config` flag points to a configuration directory containing these necessary files: - `mappings/` - JSON files that map CSV columns to GraphQL variables - `config.yaml` - _(Optional)_ Parallel processing and dependency configuration Each entity has three corresponding files across these directories with matching names. ### Example Configuration **examples/demo/mappings/items.json**: ```json { "dataFile": "data/items.csv", "dataFormat": "csv", "graphqlFile": "graphql/items.graphql", "mapping": { "name": "item_name", "sku": "item_sku" } } ``` **examples/demo/data/items.csv**: ```csv item_name,item_sku Item1,item-1-sku Item2,item-2-sku ``` **examples/demo/graphql/items.graphql**: ```graphql mutation CreateItem($name: String!, $sku: String!) { createItem(input: { name: $name, sku: $sku }) { id name sku } } ``` **examples/demo/config.yaml** _(Optional - for parallel processing and retry configuration)_: ```yaml # Parallel processing configuration parallelProcessing: concurrency: 5 # Process 5 rows per entity concurrently entityConcurrency: 2 # Process 2 entities simultaneously preserveRowOrder: false # Allow faster out-of-order completion # Global retry configuration retry: maxAttempts: 3 # Retry failed requests up to 3 times baseDelay: 1000 # Start with 1s delay between retries exponentialBackoff: true # Double delay each retry # Entity dependencies entityDependencies: items: ["users"] # Items depend on users being processed first # Per-entity overrides (optional) entityConfig: users: retry: maxAttempts: 5 # More retries for user creation items: concurrency: 10 # Higher concurrency for items ``` ## Supported Data Formats 📄 GQL Ingest now supports multiple data formats beyond CSV for more flexible data ingestion, especially for complex nested GraphQL mutations: ### Supported Formats - **CSV** - Traditional flat file format - **JSON** - Perfect for nested/complex data structures - **YAML** - Human-friendly alternative to JSON - **JSONL** - JSON Lines format for streaming large datasets ### Format Selection The tool automatically detects the format based on file extension, or you can specify it explicitly: ```bash # Auto-detect from mapping configuration gql-ingest --endpoint <url> --config ./config # Force specific format gql-ingest --endpoint <url> --config ./config --format json ``` ### JSON/YAML Format Examples #### Direct Mapping (Entire Object) For complex GraphQL mutations with nested input types, you can map the entire data object: **data/products.json**: ```json [ { "name": "Premium T-Shirt", "type": "PHYSICAL", "options": [ { "name": "Color", "values": ["Red", "Blue", "Green"] }, { "name": "Size", "values": ["S", "M", "L", "XL"] } ], "variants": [ { "name": "Red Small", "sku": "TS-RED-S", "optionMappings": [ { "name": "Color", "value": "Red" }, { "name": "Size", "value": "S" } ] } ] } ] ``` **mappings/products.json**: ```json { "dataFile": "data/products.json", "dataFormat": "json", "graphqlFile": "graphql/newProduct.graphql", "mapping": { "input": "$" // Map entire object to input variable } } ``` #### Path-Based Mapping For transforming flat JSON into nested structures: **data/products-flat.json**: ```json [ { "product_name": "Notebook", "product_type": "PHYSICAL", "brand": "ACME" } ] ``` **mappings/products-flat.json**: ```json { "dataFile": "data/products-flat.json", "graphqlFile": "graphql/newProduct.graphql", "mapping": { "input": { "name": "$.product_name", "type": "$.product_type", "brandCode": "$.brand" } } } ``` ### YAML Format YAML provides a more readable alternative: **data/products.yaml**: ```yaml - name: Premium T-Shirt type: PHYSICAL options: - name: Color values: [Red, Blue, Green] - name: Size values: [S, M, L, XL] variants: - name: Red Small sku: TS-RED-S optionMappings: - name: Color value: Red - name: Size value: S ``` ## Development ### Scripts ```bash pnpm run build # Build CLI bundle with esbuild pnpm run build:types # Generate TypeScript declarations pnpm run build:all # Build bundle + types pnpm run dev # Run in development mode pnpm run test # Run test suite ``` ## How It Works 1. **Discovery**: The tool scans the `mappings/` directory for `.json` files 2. **Dependency Resolution**: Analyzes `entityDependencies` to create execution waves 3. **Parallel Processing**: For each dependency wave: - Processes up to `entityConcurrency` entities simultaneously - Within each entity, processes up to `concurrency` CSV rows concurrently - Waits for the entire wave to complete before starting the next wave 4. **GraphQL Execution**: For each CSV row: - Loads the GraphQL mutation definition - Maps CSV columns to GraphQL variables using the mapping configuration - Executes the mutation against the GraphQL endpoint 5. **Error Handling & Retries**: - Failed mutations are automatically retried with exponential backoff - Non-retryable errors (e.g., validation failures) are logged and skipped - Configurable retry policies per entity type 6. **Metrics & Monitoring**: - Real-time progress tracking and success/failure rates - Retry attempt counts and success rates - Detailed per-entity performance breakdown ## License MIT