UNPKG

ocsv

Version:

High-performance RFC 4180 compliant CSV parser with lazy mode for Bun. Parse 10M rows in 5s with minimal memory.

678 lines (521 loc) • 19.5 kB
# OCSV - Odin CSV Parser A high-performance, RFC 4180 compliant CSV parser written in Odin with Bun FFI support. [![Release](https://github.com/dvrd/ocsv/actions/workflows/release.yml/badge.svg)](https://github.com/dvrd/ocsv/releases) [![npm version](https://badge.fury.io/js/ocsv.svg)](https://www.npmjs.com/package/ocsv) [![CI](https://github.com/dvrd/ocsv/actions/workflows/ci.yml/badge.svg)](https://github.com/dvrd/ocsv/actions) [![Tests](https://img.shields.io/badge/tests-201%20passing-brightgreen)]() [![Memory Leaks](https://img.shields.io/badge/memory%20leaks-0-brightgreen)]() [![RFC 4180](https://img.shields.io/badge/RFC%204180-compliant-blue)]() **Platform Support:** [![macOS](https://img.shields.io/badge/macOS-ARM64-blue)]() ## Features - ⚔ **High Performance** - Fast CSV parsing with SIMD optimizations - 🦺 **Memory Safe** - Zero memory leaks, comprehensive testing - āœ… **RFC 4180 Compliant** - Full CSV specification support - šŸŒ **UTF-8 Support** - Correct handling of international characters - šŸ”§ **Flexible Configuration** - Custom delimiters, quotes, comments - šŸ“¦ **Bun Native** - Direct FFI integration with Bun runtime - šŸ›”ļø **Error Handling** - Detailed error messages with line/column info - šŸŽÆ **Schema Validation** - Type checking, constraints, type conversion - 🌊 **Streaming API** - Memory-efficient chunk-based processing - šŸ”„ **Transform System** - Built-in transforms and pipelines - šŸ”Œ **Plugin System** - Extensible architecture for custom functionality ## Why Odin + Bun? **Key Advantages:** - āœ… Simple build system (no node-gyp, no Python) - āœ… Better memory safety (explicit memory management + defer) - āœ… Better error handling (enums + multiple returns) - āœ… No C++ wrapper needed (Bun FFI is direct) ## Quick Start ### npm Installation (Recommended) Install OCSV as an npm package for easy integration with your Bun projects: ```bash # Using Bun bun add ocsv # Using npm npm install ocsv ``` Then use it in your project: ```typescript import { parseCSV } from 'ocsv'; // Parse CSV string const result = parseCSV('name,age\nJohn,30\nJane,25', { hasHeader: true }); console.log(result.headers); // ['name', 'age'] console.log(result.rows); // [['John', '30'], ['Jane', '25']] // Parse CSV file import { parseCSVFile } from 'ocsv'; const data = await parseCSVFile('./data.csv', { hasHeader: true }); console.log(`Parsed ${data.rowCount} rows`); ``` ### Manual Installation (Development) For building from source or contributing: ```bash git clone https://github.com/dvrd/ocsv.git cd ocsv ``` ### Build **Current Support:** macOS ARM64 (cross-platform support in progress) ```bash # Using Task (recommended) task build # Build release library task build-dev # Build debug library task test # Run all tests task info # Show platform info # Manual build odin build src -build-mode:shared -out:libocsv.dylib -o:speed ``` ### Basic Usage (Odin) ```odin package main import "core:fmt" import ocsv "src" main :: proc() { // Create parser parser := ocsv.parser_create() defer ocsv.parser_destroy(parser) // Parse CSV data csv_data := "name,age,city\nAlice,30,NYC\nBob,25,SF\n" ok := ocsv.parse_csv(parser, csv_data) if ok { // Access parsed data fmt.printfln("Parsed %d rows", len(parser.all_rows)) for row in parser.all_rows { for field in row { fmt.printf("%s ", field) } fmt.printf("\n") } } } ``` ### Bun API Examples #### Basic Parsing ```typescript import { parseCSV } from 'ocsv'; // Parse CSV with headers const result = parseCSV('name,age,city\nAlice,30,NYC\nBob,25,SF', { hasHeader: true }); console.log(result.headers); // ['name', 'age', 'city'] console.log(result.rows); // [['Alice', '30', 'NYC'], ['Bob', '25', 'SF']] console.log(result.rowCount); // 2 ``` #### Parse from File ```typescript import { parseCSVFile } from 'ocsv'; // Parse CSV file with headers const data = await parseCSVFile('./sales.csv', { hasHeader: true, delimiter: ',', }); console.log(`Parsed ${data.rowCount} rows`); console.log(`Columns: ${data.headers.join(', ')}`); // Process rows for (const row of data.rows) { console.log(row); } ``` #### Custom Configuration ```typescript import { parseCSV } from 'ocsv'; // Parse TSV (tab-separated) const tsvData = parseCSV('col1\tcol2\trow1\tdata', { delimiter: '\t', hasHeader: true, }); // Parse with semicolon delimiter (European CSV) const europeanData = parseCSV('name;age;city\nJohn;30;Paris', { delimiter: ';', hasHeader: true, }); // Relaxed mode (allows some RFC violations) const relaxedData = parseCSV('messy,csv,"data', { relaxed: true, }); ``` #### Manual Parser Management For more control, use the `Parser` class directly: ```typescript import { Parser } from 'ocsv'; const parser = new Parser(); try { const result = parser.parse('a,b,c\n1,2,3'); console.log(result.rows); } finally { parser.destroy(); // Important: free memory } ``` ## Performance Modes OCSV offers two access modes to optimize for different use cases: ### Mode Comparison | Feature | Eager Mode (default) | Lazy Mode | |---------|---------------------|-----------| | **Performance** | ~8 MB/s throughput | **≄180 MB/s** (22x faster) | | **Memory Usage** | High (all data in JS) | **Low** (<200 MB for 10M rows) | | **Parse Time (10M rows)** | ~150s | **<7s** (21x faster) | | **Access Pattern** | Random access, arrays | Random access, on-demand | | **Memory Management** | Automatic (GC) | **Manual** (`destroy()` required) | | **Best For** | Small files, full iteration | Large files, selective access | | **TypeScript Support** | Full | Full (discriminated unions) | ### Eager Mode (Default) **Best for:** Small to medium files (<100k rows), full dataset iteration, simple workflows All rows are materialized into JavaScript arrays immediately. Easy to use, no cleanup required. ```typescript import { parseCSV } from 'ocsv'; // Default: eager mode const result = parseCSV(data, { hasHeader: true }); console.log(result.headers); // ['name', 'age', 'city'] console.log(result.rows); // [['Alice', '30', 'NYC'], ...] console.log(result.rowCount); // 2 // Arrays: standard JavaScript operations result.rows.forEach(row => console.log(row)); result.rows.map(row => row[0]); result.rows.filter(row => row[1] > '25'); ``` **Pros:** - āœ… Simple API - standard JavaScript arrays - āœ… No manual cleanup required - āœ… Familiar array methods (map, filter, slice) - āœ… Safe for GC-managed memory **Cons:** - āŒ Slower for large files (7.5x overhead) - āŒ High memory usage (all rows in JS heap) - āŒ Parse time proportional to data crossing FFI boundary ### Lazy Mode (High Performance) **Best for:** Large files (>1M rows), selective access, memory-constrained environments Rows stay in native Odin memory and are accessed on-demand. Achieves near-FFI performance with minimal memory footprint. ```typescript import { parseCSV } from 'ocsv'; // Lazy mode: high performance const result = parseCSV(data, { mode: 'lazy', hasHeader: true }); try { console.log(result.headers); // ['name', 'age', 'city'] console.log(result.rowCount); // 10000000 // On-demand row access const row = result.getRow(5000000); console.log(row.get(0)); // 'Alice' console.log(row.get(1)); // '30' // Iterate fields for (const field of row) { console.log(field); } // Materialize row to array (when needed) const arr = row.toArray(); // ['Alice', '30', 'NYC'] // Efficient slicing (generator) for (const row of result.slice(1000, 2000)) { console.log(row.get(0)); } // Full iteration (if needed) for (const row of result) { console.log(row.get(0)); } } finally { // CRITICAL: Must cleanup native memory result.destroy(); } ``` **Pros:** - āœ… **22x faster** parse time than eager mode - āœ… **Low memory** footprint (<200 MB for 10M rows) - āœ… LRU cache (1000 hot rows) for repeated access - āœ… Generator-based slicing (memory efficient) - āœ… Random access to any row (O(1) after cache) **Cons:** - āŒ **Manual cleanup required** (`destroy()` must be called) - āŒ Not standard arrays (use `.get(i)` or `.toArray()`) - āŒ Use-after-destroy throws errors ### When to Use Each Mode ``` Start | Is file size > 100MB or > 1M rows? / \ Yes No | | Do you need to Use Eager Mode access all rows? (simple, safe) / \ No Yes | | Lazy Mode Memory constrained? (fast, low / \ memory) Yes No | | Lazy Mode Try Eager first (streaming) (measure, switch if slow) ``` **Use Lazy Mode when:** - File size > 100 MB or > 1M rows - You need selective row access (not full iteration) - Memory is constrained (< 1 GB available) - You're building streaming/ETL pipelines - You need maximum parsing performance **Use Eager Mode when:** - File size < 100 MB or < 1M rows - You need full dataset iteration - You prefer simpler API (standard arrays) - Memory cleanup must be automatic (GC) - You're prototyping or writing quick scripts ### Performance Benchmarks **Test Setup:** 10M rows, 4 columns, 1.2 GB CSV file ``` Mode Parse Time Throughput Memory Usage ──────────────────────────────────────────────────────── FFI Direct 6.2s 193 MB/s 50 MB (baseline) Lazy Mode 6.8s 176 MB/s <200 MB Eager Mode 151.7s 7.9 MB/s ~8 GB ``` **Key Metrics:** - Lazy mode is **22x faster** than eager mode - Lazy mode uses **40x less memory** than eager mode - Lazy mode is **only 9% slower** than raw FFI (acceptable overhead) ### Advanced: High-Performance FFI Mode For advanced users who need maximum FFI throughput, OCSV offers an optimized packed buffer mode that achieves **61.25 MB/s** (56% of native Odin performance). **Performance Comparison (100K rows, 13.80 MB file):** ``` Mode Throughput ns/row vs Native ────────────────────────────────────────────────────── Native Odin 109.28 MB/s 915 100% Packed Buffer 61.25 MB/s 2,253 56% Bulk JSON 40.68 MB/s 2,878 37% Field-by-Field 29.58 MB/s 3,957 27% ``` **Optimizations:** - ⚔ **61.25 MB/s** average throughput - šŸš€ **Batched TextDecoder** with reduced decoder overhead - šŸ’¾ **Pre-allocated arrays** to reduce GC pressure - šŸ“Š **SIMD-friendly memory access** patterns - šŸ”„ **Adaptive processing** for different row sizes - šŸ“¦ **Binary packed format** with length-prefixed strings - ✨ **Single FFI call** instead of multiple round-trips **Usage:** ```typescript import { parseCSVPacked } from 'ocsv/bindings/simple'; // Optimized packed buffer mode (highest FFI performance) const rows = parseCSVPacked(csvData); // Returns string[][] with minimal FFI overhead ``` **When to use Packed Buffer:** - Need maximum FFI throughput (>40 MB/s) - Willing to trade API simplicity for performance - Working with medium-large files through Bun FFI - Want to minimize cross-language boundary overhead **Note:** The 44% overhead compared to native Odin is inherent to the FFI serialization boundary. This is the practical limit for JavaScript-based FFI approaches. ### Memory Management #### Eager Mode ```typescript // Automatic cleanup via garbage collector const result = parseCSV(data); // ... use result.rows ... // Memory freed automatically when result goes out of scope ``` #### Lazy Mode ```typescript // Manual cleanup required const result = parseCSV(data, { mode: 'lazy' }); try { // ... use result ... } finally { // CRITICAL: Always call destroy() result.destroy(); } ``` **Common Pitfalls:** āŒ **Forgetting to destroy:** ```typescript const result = parseCSV(data, { mode: 'lazy' }); console.log(result.getRow(0)); // Memory leak! Parser not cleaned up ``` āŒ **Use after destroy:** ```typescript const result = parseCSV(data, { mode: 'lazy' }); result.destroy(); result.getRow(0); // Error: LazyResult has been destroyed ``` āœ… **Correct pattern:** ```typescript const result = parseCSV(data, { mode: 'lazy' }); try { const row = result.getRow(0); console.log(row.toArray()); } finally { result.destroy(); } ``` ### TypeScript Support OCSV provides discriminated union types for type-safe mode selection: ```typescript import { parseCSV } from 'ocsv'; // Type: ParseResult (array-based) const eager = parseCSV(data); console.log(eager.rows[0]); // Type: string[] // Type: LazyResult (on-demand) const lazy = parseCSV(data, { mode: 'lazy' }); console.log(lazy.getRow(0)); // Type: LazyRow // Compiler error: mode mismatch const wrong = parseCSV(data, { mode: 'lazy' }); console.log(wrong.rows); // Error: Property 'rows' does not exist ``` ## Configuration ```odin // Create parser with custom configuration parser := ocsv.parser_create() defer ocsv.parser_destroy(parser) // TSV (Tab-Separated Values) parser.config.delimiter = '\t' // European CSV (semicolon) parser.config.delimiter = ';' // Comments (skip lines starting with #) parser.config.comment = '#' // Relaxed mode (handle malformed CSV) parser.config.relaxed = true // Custom quote character parser.config.quote = '\'' ``` ## RFC 4180 Compliance OCSV fully implements RFC 4180 with support for: - āœ… Quoted fields with embedded delimiters (`"field, with, commas"`) - āœ… Nested quotes (`"field with ""quotes"""` → `field with "quotes"`) - āœ… Multiline fields (newlines inside quotes) - āœ… CRLF and LF line endings (Windows/Unix) - āœ… Empty fields (consecutive delimiters: `a,,c`) - āœ… Trailing delimiters (`a,b,` → 3 fields, last is empty) - āœ… Leading delimiters (`,a,b` → 3 fields, first is empty) - āœ… Comments (extension: lines starting with `#`) - āœ… Unicode/UTF-8 (CJK characters, emojis, etc.) **Example:** ```csv # Sales data for Q1 2024 product,price,description,quantity "Widget A",19.99,"A great widget, now with more features!",100 "Gadget B",29.99,"Essential gadget Multi-line description",50 ``` ## Testing **~201 tests, 100% pass rate, 0 memory leaks** ```bash # Run all tests (standard) odin test tests # Run with memory tracking odin test tests -debug ``` ### Test Suites The project includes comprehensive test coverage across multiple suites: - Basic functionality and core parsing operations - RFC 4180 edge cases and compliance - Integration tests for end-to-end workflows - Schema validation and type checking - Transform system and pipelines - Plugin system functionality - Streaming API with chunk boundaries - Large file handling - Performance regression monitoring - Error handling and recovery strategies - Property-based fuzzing tests - Parallel processing capabilities - SIMD optimization verification ## Project Structure ``` ocsv/ ā”œā”€ā”€ src/ │ ā”œā”€ā”€ ocsv.odin # Main module │ ā”œā”€ā”€ parser.odin # RFC 4180 state machine parser │ ā”œā”€ā”€ parser_simd.odin # SIMD-optimized parser │ ā”œā”€ā”€ parser_error.odin # Error-aware parser │ ā”œā”€ā”€ streaming.odin # Streaming API │ ā”œā”€ā”€ parallel.odin # Parallel processing │ ā”œā”€ā”€ transform.odin # Transform system │ ā”œā”€ā”€ plugin.odin # Plugin architecture │ ā”œā”€ā”€ simd.odin # SIMD search functions │ ā”œā”€ā”€ error.odin # Error handling system │ ā”œā”€ā”€ schema.odin # Schema validation & type system │ ā”œā”€ā”€ config.odin # Configuration types │ └── ffi_bindings.odin # Bun FFI exports ā”œā”€ā”€ tests/ # Comprehensive test suite ā”œā”€ā”€ plugins/ # Example plugins ā”œā”€ā”€ bindings/ # Bun/TypeScript bindings ā”œā”€ā”€ benchmarks/ # Performance benchmarks ā”œā”€ā”€ examples/ # Usage examples └── README.md # This file ``` ## Requirements - **Odin:** Latest version (tested with Odin dev-2025-01) - **Bun:** v1.0+ (for FFI integration, optional) - **Platform:** macOS ARM64 (cross-platform support in development) - **Task:** v3+ (optional, for automated builds) ## Release Process This project uses automated releases via [semantic-release](https://github.com/semantic-release/semantic-release). Releases are triggered automatically when changes are pushed to the `main` branch. ### Commit Message Format All commits must follow [Conventional Commits](https://www.conventionalcommits.org/): ``` <type>(<scope>): <subject> <body> <footer> ``` **Examples:** ```bash git commit -m "feat: add streaming parser API" git commit -m "fix: handle empty fields correctly" git commit -m "docs: update installation instructions" git commit -m "feat!: remove deprecated parseFile method BREAKING CHANGE: parseFile has been removed, use parseCSVFile instead" ``` **Commit Types:** - `feat:` New feature (triggers minor version bump) - `fix:` Bug fix (triggers patch version bump) - `perf:` Performance improvement (triggers patch version bump) - `docs:` Documentation changes (no release) - `chore:` Maintenance tasks (no release) - `refactor:` Code refactoring (no release) - `test:` Test changes (no release) - `ci:` CI/CD changes (no release) ### Version Bumps - **Patch (1.1.0 → 1.1.1):** `fix:`, `perf:` - **Minor (1.1.0 → 1.2.0):** `feat:` - **Major (1.1.0 → 2.0.0):** Any commit with `BREAKING CHANGE:` in footer or `!` after type ### Release Workflow 1. Developer pushes commits to `main` branch 2. CI runs tests and builds 3. semantic-release analyzes commits 4. If releasable changes found: - Determines new version number - Updates CHANGELOG.md - Updates package.json - Creates git tag - Publishes to npm with provenance - Creates GitHub release with prebuilt binaries **Manual Release (Emergency Only):** ```bash npm run release:dry # Test what would be released git push origin main # Trigger automated release ``` ## Contributing Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines on commit messages and pull request process. **Development Workflow:** 1. Fork the repository 2. Create a feature branch 3. Make changes with tests (`odin test tests`) 4. Ensure zero memory leaks 5. Submit a pull request ## License MIT License - see [LICENSE](LICENSE) for details. ## Acknowledgments - **Odin Language:** [https://odin-lang.org/](https://odin-lang.org/) - **Bun Runtime:** [https://bun.sh/](https://bun.sh/) - **RFC 4180:** [https://www.rfc-editor.org/rfc/rfc4180](https://www.rfc-editor.org/rfc/rfc4180) ## Related Projects - **d3-dsv** - Pure JavaScript CSV/DSV parser - **papaparse** - Popular JavaScript CSV parser - **xsv** - Rust CLI tool for CSV processing - **csv-parser** - Node.js streaming CSV parser ## Contact - **Issues:** [GitHub Issues](https://github.com/dvrd/ocsv/issues) - **Discussions:** [GitHub Discussions](https://github.com/dvrd/ocsv/discussions) --- **Built with ā¤ļø using Odin + Bun** **Version:** 1.3.0 **Last Updated:** 2025-11-09