UNPKG

@doeixd/csv-utils

Version:

Utilities for csv files / arrays of objects

1,046 lines (881 loc) β€’ 65.9 kB
# CSV Utils [![npm version](https://img.shields.io/npm/v/@doeixd/csv-utils.svg)](https://www.npmjs.com/package/@doeixd/csv-utils) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) A production-ready TypeScript library for CSV manipulation, featuring robust error handling, strong typing, and a fluent interface. This library provides comprehensive utilities for parsing, transforming, analyzing, and writing CSV data / arrays of objects, with support for operations like header mapping, streaming for large files, schema validation, and async processing. ## Table of Contents - [Features](#features) - [Installation](#installation) - [Quick Start](#quick-start) - [Examples](#examples) - [Basic Operations](#basic-operations) - [Custom Type Casting](#custom-type-casting) - [Header Mapping](#header-mapping) - [Basic Mapping](#basic-mapping) - [Reading and Writing with Header Mapping](#reading-and-writing-with-header-mapping) - [Array Mapping](#array-mapping) - [Mapping Multiple Columns to an Array](#mapping-multiple-columns-to-an-array) - [Explicit Column List for Array Mapping](#explicit-column-list-for-array-mapping) - [Mapping an Array to Multiple Columns](#mapping-an-array-to-multiple-columns) - [Preamble Handling](#preamble-handling) - [Schema Validation](#schema-validation) - [Using Standard Schema](#using-standard-schema) - [Using Zod for Schema Validation](#using-zod-for-schema-validation) - [Working with Validation Results](#working-with-validation-results) - [Array Transformations](#array-transformations) - [Async Processing](#async-processing) - [Async File Operations](#async-file-operations) - [Async Iteration and Batching](#async-iteration-and-batching) - [Async Generators for Large Files](#async-generators-for-large-files) - [Error Handling and Retries](#error-handling-and-retries) - [Data Analysis and Transformation](#data-analysis-and-transformation) - [Merging Datasets](#merging-datasets) - [Simple Data Analysis](#simple-data-analysis) - [Advanced Transformations (Join, Unpivot, etc.)](#advanced-transformations-join-unpivot-etc) - [Standalone Functions](#standalone-functions-module) - [Quick Start with Standalone Functions](#quick-start-with-standalone-functions) - [Functional Composition](#functional-composition) - [API Documentation](#api-documentation) - [Core Class: CSV](#core-class-csv) - [Static Methods](#static-methods) - [Instance Methods](#instance-methods) - [Utility Objects](#utility-objects) - [CSVUtils](#csvutils) - [CSVArrayUtils](#csvarrayutils) - [Generator Functions](#generator-functions) - [Key Types and Interfaces](#key-types-and-interfaces) - [CSVError](#csverror) - [Options Interfaces](#options-interfaces) - [Casting Related Types](#casting-related-types) - [Schema Related Types](#schema-related-types) - [Other Types](#other-types) - [Memory-Efficient Stream Processing with `CSVStreamProcessor`](#memory-efficient-stream-processing-with-csvstreamprocessor) - [Creating a Stream Processor](#creating-a-stream-processor) - [Fluent Stream Transformations](#fluent-stream-transformations) - [Executing the Stream Pipeline](#executing-the-stream-pipeline) - [Troubleshooting](#troubleshooting) - [Contributing](#contributing) - [License](#license) ## Features - **πŸ”’ Type Safety** - Comprehensive TypeScript support with generic types for robust data handling. - **🧩 Flexible Header Mapping** - Sophisticated transformation between flat CSV columns and nested object structures, including mapping to/from array properties. - **πŸ“Š Rich Data Operations** - Extensive methods for querying, filtering, updating, sorting, grouping, and aggregating data. - **πŸ“ˆ Advanced Transformations** - Powerful tools for data conversion, including `join`, `pivot`, `unpivot`, `addColumn`, `castColumnType`, and more. - **⚑ Async & Parallel Processing** - Efficiently handle large files with asynchronous operations, stream processing, and worker thread support for CPU-intensive tasks. - **πŸ›‘οΈ Robust Error Handling** - Custom `CSVError` class and configurable retry mechanisms for I/O operations. - **πŸ“ Extensive Preamble Support** - Read, store, and write CSV preambles (additional header lines/comments). - **πŸš€ Fluent Interface (Builder Pattern)** - Chain methods for elegant and readable data manipulation pipelines. - **🧠 Smart Custom Type Casting** - Define custom logic to test and parse string values into specific types (numbers, dates, booleans, custom objects) on a global or per-column basis. - **πŸ”„ High-Performance Streaming API** - `CSVStreamProcessor` for processing massive CSV files with minimal memory footprint, featuring a fluent API. - **πŸ” Schema Validation** - Integrated support for data validation against `StandardSchemaV1` (compatible with Zod and other validation libraries), with modes for erroring, filtering, or keeping invalid data. - **βš–οΈ Memory Efficiency** - Stream processing utilizes a fixed-size circular buffer with automatic backpressure to manage memory usage effectively for very large datasets. - **πŸ“¦ Batch Processing** - Optimized methods for processing data in configurable batches for improved throughput in async operations. - **πŸ“¦ Standalone Functions** - Alternative functional programming style for all core operations. ## Installation ```bash npm install @doeixd/csv-utils # or yarn add @doeixd/csv-utils # or pnpm add @doeixd/csv-utils ``` ## Quick Start For more check out the [dedicated quick start guide](docs/quick-start.md) ```typescript import CSV, { CSVUtils } from '@doeixd/csv-utils'; interface Product { id: string; name: string; price: number; category: string; inventory?: number; currency?: string; } // Read from a CSV file (assuming price is numeric in CSV or cast later) const products = CSV.fromFile<Product>('products.csv'); // Chain operations const result = products .findRowsWhere(p => p.price > 100) // Find expensive products .update({ currency: 'USD' }) // Add currency field .updateColumn('price', p => p * 0.9) // Apply 10% discount .sortBy('price', 'desc') // Sort by price (high to low) .removeWhere(p => (p.inventory ?? 0) < 5) // Remove low inventory items .toArray(); // Get the results as an array // Write back to file CSVUtils.writeCSV('discounted_products.csv', result); // Alternatively, write using the CSV instance // CSV.fromData(result).writeToFile('discounted_products.csv'); ``` ## Examples ### Basic Operations ```typescript import CSV from '@doeixd/csv-utils'; interface User { id: string; name: string; role: string; department?: string; accessLevel?: string; } // Create from data const users = CSV.fromData<User>([ { id: '1', name: 'Alice', role: 'admin' }, { id: '2', name: 'Bob', role: 'user' }, { id: '3', name: 'Charlie', role: 'user' } ]); // Query operations const admin = users.findRow('1', 'id'); const allUsers = users.findRowsWhere(user => user.role === 'user'); // Transformation const withDepartment = users.update({ department: 'IT' }); const updatedUsers = users.updateWhere( user => user.role === 'admin', { accessLevel: 'full' } ); // Output as CSV string (by default, includes headers) const csvString = users.toString(); // console.log(csvString); // id,name,role // 1,Alice,admin // 2,Bob,user // 3,Charlie,user // Write to file users.writeToFile('users.csv'); ``` ### Custom Type Casting Apply sophisticated type conversions beyond basic CSV parsing. ```typescript import CSV, { Caster, CSVReadOptions } from '@doeixd/csv-utils'; interface Order { order_id: string; discount_code: string | null; // Can be 'N/A' or empty tax_rate: number; // e.g., '7.5%' -> 0.075 created_at: Date; // e.g., '12/25/2023' -> Date object price: number; // e.g., '$19.99' or '19.99' } // Custom caster for percentages (e.g., '7.5%' -> 0.075) const percentageCaster: Caster<number> = { test: (value) => typeof value === 'string' && value.endsWith('%'), parse: (value) => parseFloat(value.replace('%', '')) / 100, }; // Custom caster for dates (e.g., 'MM/DD/YYYY') const dateCaster: Caster<Date> = { test: (value) => typeof value === 'string' && /^\d{1,2}\/\d{1,2}\/\d{4}$/.test(value), parse: (value) => { const [month, day, year] = value.split('/').map(Number); return new Date(year, month - 1, day); // Month is 0-indexed }, }; // Custom caster for potentially null string values const nullableStringCaster: Caster<string | null> = { test: (value) => typeof value === 'string' && (value.toUpperCase() === 'N/A' || value.trim() === ''), parse: () => null, }; const readOptions: CSVReadOptions<Order> = { customCasts: { definitions: { // Globally available casters by key number: { test: (value) => typeof value === 'string' && !isNaN(parseFloat(value.replace(/[^0-9.-]+/g, ""))), parse: (value) => parseFloat(value.replace(/[^0-9.-]+/g, "")), }, date: dateCaster, // Use our custom dateCaster nullableString: nullableStringCaster, }, columnCasts: { // Column-specific rules order_id: 'string', // Use built-in string caster (or keep as is if already string) discount_code: ['nullableString'], // Try nullableString caster first tax_rate: [percentageCaster, 'number'], // Try percentage, then general number created_at: 'date', price: [ // Try multiple specific casters for price { // Caster for '$XX.YY' format test: (v) => typeof v === 'string' && v.startsWith('$'), parse: (v) => parseFloat(v.substring(1)), }, 'number', // Fallback to general number caster ], }, onCastError: 'error', // 'error' (default), 'null', or 'original' }, }; // Assuming 'orders.csv' contains: // order_id,discount_code,tax_rate,created_at,price // ORD001,NA,7.5%,12/25/2023,$19.99 // ORD002,,5%,01/15/2024,25 const orders = CSV.fromFile<Order>('orders.csv', readOptions); const firstOrder = orders.toArray()[0]; console.log(firstOrder.tax_rate); // 0.075 console.log(firstOrder.created_at instanceof Date); // true console.log(firstOrder.price); // 19.99 console.log(firstOrder.discount_code); // null ``` ### Header Mapping Transform CSV column names to/from nested object properties. #### Basic Mapping ```typescript import { createHeaderMapFns, HeaderMap } from '@doeixd/csv-utils'; interface User { id: string; profile: { firstName: string; lastName: string; }; contact: { email: string; }; } // Define a mapping: CSV header -> object path const headerMap: HeaderMap<User> = { 'user_id': 'id', 'first_name': 'profile.firstName', 'last_name': 'profile.lastName', 'email_address': 'contact.email', }; // Create mapping functions const { fromRowArr, toRowArr } = createHeaderMapFns<User>(headerMap); // Convert CSV row (object) to structured object const csvRow = { user_id: '123', first_name: 'John', last_name: 'Doe', email_address: 'john@example.com', }; const userObject = fromRowArr(csvRow); console.log(userObject.profile.firstName); // John // Convert structured object back to a flat array for CSV writing const csvHeaders = ['user_id', 'first_name', 'last_name', 'email_address']; const flatArray = toRowArr(userObject, csvHeaders); console.log(flatArray); // ['123', 'John', 'Doe', 'john@example.com'] ``` #### Reading and Writing with Header Mapping ```typescript import CSV, { HeaderMap } from '@doeixd/csv-utils'; interface User { id: string; profile: { firstName: string; lastName: string; }; } // --- READING (flat CSV columns -> nested object properties) --- const inputHeaderMap: HeaderMap<User> = { 'USER_IDENTIFIER': 'id', 'GIVEN_NAME': 'profile.firstName', 'FAMILY_NAME': 'profile.lastName', }; // Assumes users_input.csv has columns: USER_IDENTIFIER,GIVEN_NAME,FAMILY_NAME const users = CSV.fromFile<User>('users_input.csv', { headerMap: inputHeaderMap }); console.log(users.toArray()[0].profile.firstName); // --- WRITING (nested object properties -> flat CSV columns) --- const outputHeaderMap: HeaderMap<User> = { 'id': 'UserID', // map 'id' property to 'UserID' CSV column 'profile.firstName': 'FirstName', 'profile.lastName': 'LastName', }; users.writeToFile('users_output.csv', { headerMap: outputHeaderMap, stringifyOptions: { header: true } // Ensure specified headers are written }); // users_output.csv will have columns: UserID,FirstName,LastName ``` #### Array Mapping Map multiple CSV columns to/from an array property in your objects. ##### Mapping Multiple Columns to an Array ```typescript import CSV, { HeaderMap, CsvToArrayConfig } from '@doeixd/csv-utils'; interface Product { id: string; name: string; imageUrls: string[]; } // CSV columns 'image_1', 'image_2', ... map to 'imageUrls' array const productHeaderMap: HeaderMap<Product> = { 'product_sku': 'id', 'product_name': 'name', // This special key (e.g., '_imageMapping') is a config, not a CSV column. '_imageMappingConfig': { _type: 'csvToTargetArray', targetPath: 'imageUrls', // Property in Product interface sourceCsvColumnPattern: /^image_url_(\d+)$/, // Matches 'image_url_1', 'image_url_2', etc. // Optional: sort columns before adding to array (e.g., by the number in pattern) sortSourceColumnsBy: (match) => parseInt(match[1], 10), // Optional: transform each value before adding to array transformValue: (value) => (value ? `https://cdn.example.com/${value}` : null), // Optional: filter out null/empty values after transformation filterEmptyValues: true, } as CsvToArrayConfig, }; // Assuming products_images.csv: // product_sku,product_name,image_url_2,image_url_1 // SKU001,Awesome Gadget,gadget_thumb.jpg,gadget_main.jpg const products = CSV.fromFile<Product>('products_images.csv', { headerMap: productHeaderMap }); // products.toArray()[0].imageUrls will be ['https://cdn.example.com/gadget_main.jpg', 'https://cdn.example.com/gadget_thumb.jpg'] ``` ##### Explicit Column List for Array Mapping ```typescript // If CSV columns don't follow a pattern, list them explicitly: const explicitImageMap: HeaderMap<Product> = { 'product_sku': 'id', 'product_name': 'name', '_imageMappingConfig': { _type: 'csvToTargetArray', targetPath: 'imageUrls', sourceCsvColumns: ['mainProductImage', 'thumbnailImage', 'galleryImage3'], } as CsvToArrayConfig, }; ``` ##### Mapping an Array to Multiple Columns ```typescript import CSV, { HeaderMap, ObjectArrayToCsvConfig } from '@doeixd/csv-utils'; // (Product interface is same as above) const productsData: Product[] = [ { id: 'SKU002', name: 'Another Item', imageUrls: ['item_front.png', 'item_back.png'] } ]; // Map 'imageUrls' array back to CSV columns 'image_col_0', 'image_col_1', ... const writeProductHeaderMap: HeaderMap<Product> = { 'id': 'product_sku', 'name': 'product_name', 'imageUrls': { // Key must match the array property name in Product _type: 'targetArrayToCsv', targetCsvColumnPrefix: 'image_col_', // Output columns: image_col_0, image_col_1, ... maxColumns: 3, // Create up to 3 image columns emptyCellOutput: '', // Value for empty cells if array is shorter than maxColumns // Optional: transform value before writing transformValue: (value) => value.replace('https://cdn.example.com/', ''), } as ObjectArrayToCsvConfig, }; CSV.fromData(productsData).writeToFile('products_output_arrays.csv', { headerMap: writeProductHeaderMap, stringifyOptions: { header: true } }); // products_output_arrays.csv might have: // product_sku,product_name,image_col_0,image_col_1,image_col_2 // SKU002,Another Item,item_front.png,item_back.png,"" ``` ### Preamble Handling Manage metadata or comments at the beginning of CSV files. ```typescript import CSV from '@doeixd/csv-utils'; // Example CSV file (data_with_preamble.csv): // # File Generated: 2024-01-01 // # Source: SystemX // id,name,value // 1,Alpha,100 // 2,Beta,200 // --- Reading with Preamble --- const csvInstance = CSV.fromFile('data_with_preamble.csv', { saveAdditionalHeader: true, // Enable preamble capture csvOptions: { from_line: 3, // Actual data starts on line 3 comment: '#', // Treat lines starting with # as comments (part of preamble if before from_line) }, // Optional: dedicated parsing options for the preamble itself additionalHeaderParseOptions: { delimiter: ',', // If preamble has a different structure // Note: options like 'columns', 'from_line', 'to_line' are overridden for preamble. } }); console.log('Preamble:\n', csvInstance.additionalHeader); // Preamble: // # File Generated: 2024-01-01 // # Source: SystemX console.log('Data:', csvInstance.toArray()); // Data: [ { id: '1', name: 'Alpha', value: '100' }, { id: '2', name: 'Beta', value: '200' } ] // --- Writing with Preamble --- const preambleContent = `# Exported: ${new Date().toISOString()}\n# User: admin\n`; csvInstance.writeToFile('output_with_preamble.csv', { additionalHeader: preambleContent, }); // To preserve an existing preamble when modifying and saving: const modifiedCsv = csvInstance.updateColumn('value', v => parseInt(v) * 2); modifiedCsv.writeToFile('modified_output.csv', { additionalHeader: csvInstance.additionalHeader // Use the original preamble }); ``` **Note on `saveAdditionalHeader`:** - If `number > 0`: Specifies the exact number of lines to extract as the preamble. Data parsing will start after these lines, unless `csvOptions.from_line` points to an even later line. - If `true`: Enables preamble extraction *if* `csvOptions.from_line` is set to a value greater than 1. The preamble will consist of `csvOptions.from_line - 1` lines. - If `false`, `0`, or `undefined`: No preamble is extracted. ### Schema Validation Validate CSV data against predefined schemas. #### Using Standard Schema This library supports `StandardSchemaV1` for defining custom validation logic. ```typescript import CSV, { StandardSchemaV1, CSVSchemaConfig } from '@doeixd/csv-utils'; interface User { id: number; email: string; age?: number; } // Custom schema for validating email strings const emailFormatSchema: StandardSchemaV1<string, string> = { '~standard': { version: 1, vendor: 'csv-utils-example', validate: (value: unknown): StandardSchemaV1.Result<string> => { if (typeof value !== 'string') return { issues: [{ message: 'Must be a string' }] }; if (!/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(value)) return { issues: [{ message: 'Invalid email format' }] }; return { value }; }, types: { input: '' as string, output: '' as string } } }; const userSchemaConfig: CSVSchemaConfig<User> = { columnSchemas: { id: { // Ensure ID is a positive number (example with simple validation) '~standard': { version: 1, vendor: 'csv-utils-example', validate: (v: unknown) => { const n = Number(v); if (isNaN(n) || n <= 0) return { issues: [{message: "ID must be a positive number"}]}; return { value: n }; }, types: { input: undefined as any, output: 0 as number } } }, email: emailFormatSchema, }, validationMode: 'filter', // 'error', 'filter', or 'keep' // useAsync: false // Default, set to true for async validation logic within schemas }; // Assuming users_for_validation.csv: // id,email,age // 1,alice@example.com,30 // two,bob-invalid-email,25 // 3,carol@example.com, const users = CSV.fromFile<User>('users_for_validation.csv', { schema: userSchemaConfig }); // 'users' will only contain valid rows due to 'filter' mode. // { id: 1, email: 'alice@example.com', age: '30' } // age is still string from parser // { id: 3, email: 'carol@example.com', age: '' } ``` #### Using Zod for Schema Validation Requires `zod` to be installed (`npm install zod`). ```typescript import CSV, { CSVSchemaConfig } from '@doeixd/csv-utils'; import { z } from 'zod'; const zodUserSchema = z.object({ id: z.string().min(1, "ID is required"), name: z.string().min(2, "Name must be at least 2 characters"), email: z.string().email("Invalid email address"), age: z.number().positive("Age must be a positive number").optional(), }); type ZodUser = z.infer<typeof zodUserSchema>; const csvWithZodSchema: CSVSchemaConfig<ZodUser> = { rowSchema: zodUserSchema, // Apply to the whole row after initial parsing & custom casting // columnSchemas: { // Can also define Zod schemas for individual columns for pre-rowSchema validation // age: z.coerce.number().positive().optional() // Coerce age to number before row validation // }, validationMode: 'filter', // useAsync: true // If any Zod schema uses async refinements }; // Example: use customCasts to convert 'age' before Zod validation const usersZod = CSV.fromFile<ZodUser>('users_data.csv', { customCasts: { // Convert age string to number before Zod validation columnCasts: { age: 'number' }, definitions: { number: { test: v => !isNaN(parseFloat(v)), parse: v => parseFloat(v) } } }, schema: csvWithZodSchema }); ``` #### Working with Validation Results If `validationMode: 'keep'` is used, results are available on the `CSV` instance. ```typescript const configKeep: CSVSchemaConfig<User> = { /* ... */ validationMode: 'keep' }; const usersResult = CSV.fromFile<User>('users.csv', { schema: configKeep }); if (usersResult.validationResults) { usersResult.validationResults.forEach(res => { if (!res.valid) { console.log(`Invalid row: ${JSON.stringify(res.originalRow)}`); if (res.rowIssues) console.log(' Row issues:', res.rowIssues.map(i => i.message)); if (res.columnIssues) { Object.entries(res.columnIssues).forEach(([col, issues]) => { console.log(` Column '${col}' issues:`, issues.map(i => i.message)); }); } } }); } ``` ### Array Transformations Utilities for converting between arrays of arrays and arrays of objects. ```typescript import { CSVArrayUtils, HeaderMap } from '@doeixd/csv-utils'; interface ProductRecord { id: string; productName: string; unitPrice: number; category: string; } // --- Array of Arrays -> Array of Objects --- const csvDataAsArrays = [ ['SKU', 'Item Name', 'Price', 'Type'], // Header ['A123', 'Super Widget', '19.99', 'Gadgets'], ['B456', 'Mega Thinger', '29.50', 'Gizmos'], ]; const productMap: HeaderMap<ProductRecord> = { 0: 'id', // Index 0 maps to 'id' 1: 'productName', 2: 'unitPrice', // This will be a string initially from CSV 3: 'category', }; const productsArray = CSVArrayUtils.arrayToObjArray<ProductRecord>( csvDataAsArrays.slice(1), // Data rows productMap ); // productsArray[0] = { id: 'A123', productName: 'Super Widget', unitPrice: '19.99', category: 'Gadgets' } // Note: For type conversion (e.g., string '19.99' to number), use CSV class with customCasts or schema validation. // --- Array of Objects -> Array of Arrays --- const productObjects: ProductRecord[] = [ { id: 'C789', productName: 'Hyper Spanner', unitPrice: 9.95, category: 'Tools' }, ]; // Map object properties back to array indices/CSV headers const outputMapConfig: HeaderMap = { // Here, keys are object paths, values are CSV headers or indices 'id': 'Product ID', 'productName': 'Name', 'unitPrice': 'Cost', 'category': 'Department', }; const outputHeaders = ['Product ID', 'Name', 'Cost', 'Department']; const arraysForCsv = CSVArrayUtils.objArrayToArray<ProductRecord>( productObjects, outputMapConfig, outputHeaders, true // Include headers as the first row ); // arraysForCsv = [ // ['Product ID', 'Name', 'Cost', 'Department'], // ['C789', 'Hyper Spanner', 9.95, 'Tools'] // ] // --- Grouping --- const groupedByCategory = CSVArrayUtils.groupByField(productsArray, 'category'); // groupedByCategory['Gadgets'] would be an array of products in that category. ``` ### Async Processing Handle large datasets and I/O-bound operations efficiently. #### Async File Operations ```typescript import CSV from '@doeixd/csv-utils'; interface MyData { /* ... */ } // Asynchronously read from a file (loads all data into memory after parsing) async function loadDataAsync() { const csvData = await CSV.fromFileAsync<MyData>('large_dataset.csv', { // CSVReadOptions apply here, e.g., headerMap, customCasts, schema }); console.log(`Loaded ${csvData.count()} records.`); return csvData; } // Asynchronously write to a file async function saveDataAsync(csvInstance: CSV<MyData>) { await csvInstance.writeToFileAsync('output_dataset.csv'); console.log('Data written asynchronously.'); } ``` #### Async Iteration and Batching ```typescript async function processDataInBatches(csvInstance: CSV<MyData>) { // Process each row with an async callback await csvInstance.forEachAsync(async (row, index) => { // await someAsyncDbUpdate(row); console.log(`Processed row ${index + 1} asynchronously.`); }, { batchSize: 100, batchConcurrency: 5 }); // 100 items per batch, 5 batches concurrently // Transform data with an async mapping function const enrichedData = await csvInstance.mapAsync(async (row) => { // const details = await fetchExtraDetails(row.id); // return { ...row, ...details }; return row; // Placeholder }, { batchSize: 50, batchConcurrency: 10 }); console.log(`Enriched ${enrichedData.length} records.`); } ``` #### Async Generators for Large Files Ideal for memory-efficient processing of very large files. ```typescript import { csvGenerator, csvBatchGenerator, writeCSVFromGenerator, CSVStreamOptions } from '@doeixd/csv-utils'; interface LogEntry { timestamp: string; level: string; message: string; } const streamOptions: CSVStreamOptions<LogEntry> = { csvOptions: { columns: true, trim: true }, // headerMap: { /* ... */ }, // Optional header mapping // transform: (row) => ({ ...row, parsedAt: new Date() }) // Optional row transformation }; async function analyzeLogs() { // Process row by row let errorCount = 0; for await (const log of csvGenerator<LogEntry>('application.log', streamOptions)) { if (log.level === 'ERROR') errorCount++; } console.log(`Total error logs: ${errorCount}`); // Process in batches for await (const batch of csvBatchGenerator<LogEntry>('application.log', { ...streamOptions, batchSize: 1000 })) { // await bulkInsertToDb(batch); console.log(`Processed batch of ${batch.length} logs.`); } } // Example: Transform and write using generators async function transformAndWriteLogs() { async function* transformedLogGenerator() { for await (const log of csvGenerator<LogEntry>('input.log')) { if (log.level === 'INFO') { // Filter and transform yield { ...log, message: log.message.toUpperCase() } as LogEntry; } } } await writeCSVFromGenerator('output_info_logs.csv', transformedLogGenerator()); } ``` ### Error Handling and Retries ```typescript import CSV, { CSVError } from '@doeixd/csv-utils'; try { const data = CSV.fromFile('potentially_flaky_network_file.csv', { retry: { maxRetries: 3, // Attempt up to 3 times after initial failure baseDelay: 500, // Initial delay 500ms, then 1000ms, 2000ms (exponential backoff) logRetries: true, // Log retry attempts to console.warn } }); // ... process data } catch (error) { if (error instanceof CSVError) { console.error(`CSV operation failed: ${error.message}`); if (error.cause) { console.error('Underlying cause:', error.cause); } } else { console.error('An unexpected error occurred:', error); } } ``` ### Data Analysis and Transformation #### Merging Datasets ```typescript import CSV from '@doeixd/csv-utils'; interface InventoryItem { sku: string; name: string; price: number; stock: number; } interface SalesDataItem { sku: string; unitsSold: number; } const inventory = CSV.fromData<InventoryItem>([ { sku: 'A1', name: 'Apple', price: 1.0, stock: 100 }, { sku: 'B2', name: 'Banana', price: 0.5, stock: 150 }, ]); const sales = CSV.fromData<SalesDataItem>([ { sku: 'A1', unitsSold: 10 }, { sku: 'C3', unitsSold: 5 }, // This SKU not in inventory ]); // Merge sales data into inventory, updating stock const updatedInventory = inventory.mergeWith( sales, (invItem, saleItem) => invItem.sku === saleItem.sku, // Equality condition (invItem, saleItem) => ({ // Merge function for matched items ...invItem, stock: invItem.stock - saleItem.unitsSold, }) ); // updatedInventory will have Banana unchanged, Apple with reduced stock. // Items only in 'sales' are not included by default with this merge logic. ``` #### Simple Data Analysis ```typescript import CSV from '@doeixd/csv-utils'; interface Sale { product: string; region: string; amount: number; month: string; } const salesData = CSV.fromData<Sale>([ { product: 'Laptop', region: 'North', amount: 1200, month: 'Jan' }, { product: 'Mouse', region: 'North', amount: 25, month: 'Jan' }, { product: 'Laptop', region: 'South', amount: 1500, month: 'Feb' }, { product: 'Keyboard', region: 'North', amount: 75, month: 'Jan' }, ]); const totalRevenue = salesData.aggregate('amount', 'sum'); // Sum of 'amount' const averageSale = salesData.aggregate('amount', 'avg'); const uniqueRegions = salesData.distinct('region'); // ['North', 'South'] // Pivot table: product sales by region const salesPivot = salesData.pivot('product', 'region', 'amount'); // salesPivot = { // Laptop: { North: 1200, South: 1500 }, // Mouse: { North: 25 }, // Keyboard: { North: 75 } // } ``` #### Advanced Transformations (Join, Unpivot, etc.) ```typescript import CSV from '@doeixd/csv-utils'; // --- Join Example --- interface User { id: number; name: string; cityId: number; } interface City { cityId: number; cityName: string; } const users = CSV.fromData<User>([ { id: 1, name: 'Alice', cityId: 101 }, { id: 2, name: 'Bob', cityId: 102 } ]); const cities = CSV.fromData<City>([ { cityId: 101, cityName: 'New York' }, { cityId: 103, cityName: 'Paris' } ]); const usersWithCities = users.join( cities, { left: 'cityId', right: 'cityId', type: 'left' }, // Left join on cityId (user, city) => ({ // Custom select function for the result userId: user!.id, userName: user!.name, cityName: city ? city.cityName : 'Unknown', }) ); // usersWithCities.toArray() would include Alice with New York, Bob with Unknown city. // --- Unpivot Example --- interface QuarterlySales { product: string; q1: number; q2: number; } const wideSales = CSV.fromData<QuarterlySales>([ { product: 'Gadget', q1: 100, q2: 150 } ]); const longSales = wideSales.unpivot( ['product'], // ID columns to repeat ['q1', 'q2'], // Value columns to unpivot 'quarter', // Name for the new 'variable' column 'sales' // Name for the new 'value' column ); // longSales.toArray() = [ // { product: 'Gadget', quarter: 'q1', sales: 100 }, // { product: 'Gadget', quarter: 'q2', sales: 150 } // ] // Other useful transformations: const sampleData = CSV.fromData([{ a:1, b:" x "}, {a:2, b:" y "}]); const cleanedData = sampleData .addColumn('c', row => row.a * 2) // Add new column 'c' .renameColumn('a', 'alpha') // Rename 'a' to 'alpha' .castColumnType('alpha', 'string') // Cast 'alpha' to string .normalizeText('b', 'uppercase') // Uppercase column 'b' .trimWhitespace(['b']) // Trim whitespace from 'b' .fillMissingValues('alpha', 'N/A'); // Fill missing in 'alpha' (if any) ``` ## Standalone Functions Module For a more functional programming style, standalone functions are available. They operate on arrays of objects and return new arrays or values, mirroring the `CSV` class methods. ### Quick Start with Standalone Functions ```typescript import { findRowsWhere, updateColumn, sortBy, aggregate } from '@doeixd/csv-utils/standalone'; // Or import all as a namespace: import csvFn from '@doeixd/csv-utils/standalone'; interface Product { id: string; name: string; price: number; category: string; } const products: Product[] = [ { id: 'P001', name: 'Laptop', price: 899.99, category: 'Electronics' }, { id: 'P002', name: 'Headphones', price: 149.99, category: 'Electronics' }, { id: 'P003', name: 'T-shirt', price: 19.99, category: 'Clothing' }, ]; // Find expensive electronics const expensiveElectronics = findRowsWhere( products, p => p.category === 'Electronics' && p.price > 500 ); // Apply discount to all products const discounted = updateColumn(products, 'price', (price: number) => price * 0.9); // Sort products by price (descending) const sortedByPrice = sortBy(products, 'price', 'desc'); // Get max price const maxPrice = aggregate(products, 'price', 'max'); // csvFn.aggregate(...) ``` ### Functional Composition Standalone functions are well-suited for composition libraries like `fp-ts`. ```typescript import { pipe } from 'fp-ts/function'; // Example with fp-ts import { findRowsWhere, updateColumn, sortBy } from '@doeixd/csv-utils/standalone'; // (products array defined as above) const processProducts = (data: Product[]) => pipe( data, d => findRowsWhere(d, p => p.category === 'Electronics'), d => updateColumn(d, 'price', (price: number) => price * 0.9), d => sortBy(d, 'price', 'asc') ); const processed = processProducts(products); ``` ## API Documentation ### Core Class: CSV The central class for CSV manipulation with a fluent interface. #### Static Methods | Method | Description | Return Type | | :------------------------------------------- | :----------------------------------------------------------------------------- | :-------------------------- | | `fromFile<T>(filename, options?)` | Creates a CSV instance from a file path. | `CSV<T>` | | `fromData<T>(data)` | Creates a CSV instance from an array of objects. | `CSV<T>` | | `fromString<T>(csvString, options?)` | Creates a CSV instance from a CSV content string. | `CSV<T>` | | `fromStream<T>(stream, options?)` | Creates a CSV instance from a NodeJS Readable stream. | `Promise<CSV<T>>` | | `fromFileAsync<T>(filename, options?)` | Asynchronously creates a CSV instance from a file path using streams. | `Promise<CSV<T>>` | | `streamFromFile<SourceRowType>(filename, options?)` | Creates a `CSVStreamProcessor` for fluent, memory-efficient stream operations. | `CSVStreamProcessor<SourceRowType, SourceRowType>` | _`options` for read methods are typically `CSVReadOptions<T>`._ #### Instance Methods ##### Data Retrieval & Output | Method | Description | Return Type | | :------------------------------------------- | :-------------------------------------------------------------------------- | :-------------------------- | | `toArray()` | Returns the internal data as a new array of objects. | `T[]` | | `toString(options?: CsvStringifyOptions<T>)` | Converts the data to a CSV string. Supports `headerMap` via options. | `string` | | `count()` | Returns the number of rows. | `number` | | `getBaseRow(defaults?)` | Creates a template object based on the CSV's column structure. | `Partial<T>` | | `createRow(data?)` | Creates a new row object conforming to the CSV's structure. | `T` | | `writeToFile(filename, options?)` | Writes the CSV data to a file. | `void` | | `writeToFileAsync(filename, options?)` | Asynchronously writes the CSV data to a file. | `Promise<void>` | ##### Validation | Method | Description | Return Type | | :------------------------------------------- | :-------------------------------------------------------------------------- | :-------------------------- | | `validate<U = T>(schema)` | Validates data synchronously against a schema. Throws on async schema. | `CSV<U>` | | `validateAsync<U = T>(schema)` | Validates data asynchronously against a schema. | `Promise<CSV<U>>` | | `validationResults` (readonly property) | Array of `RowValidationResult<T>` if schema validation used 'keep' mode. | `RowValidationResult<T>[] \| undefined` | ##### Query Methods | Method | Description | Return Type | | :----------------------------------- | :------------------------------------------------------------------ | :------------------------------ | | `findRow(value, column?)` | Finds the first row where `column` strictly matches `value`. | `T \| undefined` | | `findRowByRegex(regex, column?)` | Finds the first row where `column` matches `regex`. | `T \| undefined` | | `findRows(value, column?)` | Finds all rows where `column` (as string) includes `value` (as string). | `T[]` | | `findRowWhere(predicate)` | Finds the first row matching the `predicate` function. | `T \| undefined` | | `findRowsWhere(predicate)` | Finds all rows matching the `predicate` function. | `T[]` | | `findSimilarRows(str, column)` | Finds rows with string similarity to `str` in `column`, sorted by distance. | `SimilarityMatch<T>[]` | | `findMostSimilarRow(str, column)` | Finds the most similar row to `str` in `column`. | `SimilarityMatch<T> \| undefined` | ##### Transformation Methods | Method | Description | Return Type | | :------------------------------------------------------ | :----------------------------------------------------------------------- | :---------------------------------------------- | | `update(modifications)` | Updates all rows. `modifications` can be an object or a function. | `CSV<T>` | | `updateWhere(condition, modifications)` | Updates rows matching `condition`. | `CSV<T>` | | `updateColumn(column, valueOrFn)` | Updates a specific `column` in all rows. | `CSV<T>` | | `transform<R>(transformer)` | Transforms each row into a new structure `R`. | `CSV<R>` | | `removeWhere(condition)` | Removes rows matching `condition`. | `CSV<T>` | | `append(...rows)` | Adds new `rows` to the dataset. | `CSV<T>` | | `mergeWith<E>(other, equalityFn, mergeFn)` | Merges with another dataset `other` (array or `CSV<E>`). | `CSV<T>` | | `addColumn<NK, NV>(colName, valOrFn)` | Adds a new column `colName` of type `NK` with values of type `NV`. | `CSV<T & Record<NK, NV>>` | | `removeColumn<K>(colNames)` | Removes one or more `colNames`. | `CSV<Omit<T, K>>` | | `renameColumn<OK, NK>(oldName, newName)` | Renames `oldName` (type `OK`) to `newName` (type `NK`). | `CSV<Omit<T, OK> & Record<NK, T[OK]>>` | | `reorderColumns(orderedNames)` | Reorders columns based on `orderedNames`. | `CSV<T>` | | `castColumnType(colName, targetType)` | Casts `colName` to `targetType` ('string', 'number', 'boolean', 'date'). | `CSV<T>` (underlying data type changes) | | `deduplicate(colsToCheck?)` | Removes duplicate rows, optionally checking specific `colsToCheck`. | `CSV<T>` | | `split(condition)` | Splits data into two `CSV` instances (`pass`, `fail`) based on `condition`. | `{ pass: CSV<T>; fail: CSV<T> }` | | `join<O, J>(otherCsv, onConfig, selectFn?)` | Joins with `otherCsv` (`CSV<O>`) based on `onConfig`, produces `CSV<J>`. | `CSV<J>` | | `unpivot<I, V, VN, VLN>(idCols, valCols, varN?, valN?)` | Transforms data from wide to long format. | `CSV< Π½ΠΎΠ²ΠΎΠΉ_структуры >` | | `fillMissingValues<K>(colName, valOrFn)` | Fills `null`/`undefined` in `colName`. | `CSV<T>` | | `normalizeText<K>(colName, normType)` | Normalizes text case in `colName` (`lowercase`, `uppercase`, `capitalize`).| `CSV<T>` | | `trimWhitespace(columns?)` | Trims whitespace from string values in specified (or all) `columns`. | `CSV<T>` | ##### Analysis & Sampling Methods | Method | Description | Return Type | | :------------------------------------------- | :----------------------------------------------------------------------- | :----------------- | | `groupBy(column)` | Groups rows by values in `column`. | `Record<string, T[]>` | | `sortBy<K>(column, direction?)` | Sorts rows by `column`. | `CSV<T>` | | `sortByAsync<K>(column, direction?)` | Asynchronously sorts rows, potentially using worker threads. | `Promise<CSV<T>>` | | `aggregate<K>(column, operation?)` | Calculates 'sum', 'avg', 'min', 'max', 'count' for `column`. | `number` | | `distinct<K>(column)` | Gets unique values from `column`. | `Array<T[K]>` | | `pivot(rowCol, colCol, valCol)` | Creates a pivot table. | `Record<string, Record<string, unknown>>` | | `sample(count?)` | Gets `count` random rows. | `CSV<T>` | | `head(count?)` / `take(count?)` | Gets the first `count` rows. | `CSV<T>` | | `tail(count?)` | Gets the last `count` rows. | `CSV<T>` | ##### Iteration Methods | Method | Description | Return Type | | :------------------------------------------- | :----------------------------------------------------------------------- | :----------------- | | `forEach(callback)` | Executes `callback` for each row. | `void` | | `forEachAsync(callback, options?)` | Asynchronously executes `callback` for each row, with batching. | `Promise<void>` | | `map<R>(callback)` | Creates a new array by applying `callback` to each row. | `R[]` | | `mapAsync<R>(callback, options?)` | Asynchronously creates a new array, with batching. | `Promise<R[]>` | | `reduce<R>(callback, initialValue)` | Reduces rows to a single value. | `R` | | `reduceAsync<R>(callback, initialValue, options?)` | Asynchronously reduces rows, with optimized batching/parallel strategies. | `Promise<R>` | ### Utility Objects #### CSVUtils Standalone utility functions. | Function | Description | | :--------------------------------------------- | :----------------------------------------------------------------------- | | `mergeRows(arrA, arrB, eqFn, mergeFn)` | Merges two arrays of objects based on custom logic. | | `clone(obj)` | Deep clones an object (using `JSON.parse(JSON.stringify(obj))`). | | `isValidCSV(str)` | Performs a quick check if a string seems to be valid CSV. | | `writeCSV(filename, data, options?)` | Writes an array of objects `data` to a CSV `filename`. | | `writeCSVAsync(filename, data, options?)` | Asynchronously writes `data` to `filename`. | | `createTransformer<T, R>(transformFn)` | Creates a NodeJS `Transform` stream for row-by-row transformation. | | `processInWorker<T, R>(operation, data)` | Executes a serializable `operation` with `data` in a worker thread. | | `processInParallel<T, R>(items, op, opts?)` | Processes `items` in parallel using worker threads. Not for order-dependent ops like sort. | #### CSVArrayUtils Utilities for converting between arrays and objects, often used with header maps. | Function | Description | | :--------------------------------------------- | :----------------------------------------------------------------------- | | `arrayToObjArray<T>(data, headerMap, headerRow?)` | Transforms an array of arrays/objects `data` to an array of `T` objects using `headerMap`. | | `objArrayToArray<T>(data, headerMap, headers?, includeHeaders?)` | Transforms an array of `T` objects `data` to an array of arrays using `headerMap`. | | `groupByField<T>(data, field)` | Groups an array of `T` objects `data` by the value of `field` (can be a dot-path). | ### Generator Functions For memory-efficient processing of large CSV files. | Function | Description | | :--------------------------------------------- | :----------------------------------------------------------------------- | | `csvGenerator<T>(filename, options?)` | Asynchronously yields rows of type `T` one by one from `filename`. | | `csvBatchGenerator<T>(filename, options?)` | Asynchronously yields batches (arrays of `T`) from `filename`. | | `writeCSVFromGenerator<T>(filename, generator, options?)` | Writes data from an async `generator` of `T` rows to `filename`. | _`options` for generator functions are `CSVStreamOptions<T>`._ ### Key Types and Interfaces #### CSVError Custom error class for all library-specific errors. - `message: string` - Error description. - `cause?: unknown` - The original error, if any, that led to this `CSVError`. #### Options Interfaces - **`CSVReadOptions<T>`**: Configures CSV reading operations. - `fsOptions?`: NodeJS file system options. - `csvOptions?`: Options for `csv-parse` (e.g., `delimiter`, `quote`, `skip_empty_lines`). Default: `{ columns: true }`. - `transform?: (content: string) => string`: Pre-parsing transform for raw file content. - `headerMap?: HeaderMap<T>`: Configuration for mapping CSV columns to object properties (see [Header Mapping](#header-mapping)). - `retry?: RetryOptions`: Configuration for retrying failed read operations. - `validateData?: boolean`: Basic structural validation of parsed data. - `schema?: CSVSchemaConfig<T>`: Configuration for data validation against schemas (see [Schema Validation](#schema-validation)). - `saveAdditionalHeader?: boolean | number`: Extracts initial lines as a preamble (see [Preamble Handling](#preamble-handling)). - `additionalHeaderParseOptions?`: `csv-parse` options specifically for parsing the preamble. - `customCasts?`: Configuration for advanced type casting (see [Custom Type Casting](#custom-type-casting)). - `definitions?: CustomCastDefinition`: Global named casters. - `columnCasts?: ColumnCastConfig<T>`: Per-column casting rules. - `onCastError?: 'error' | 'null' | 'original'`: Behavior on casting failure. - **`CSVWriteOptions<T>`**: Configures CSV writing operations. - `additionalHeader