UNPKG

ocsv

Version:

High-performance RFC 4180 compliant CSV parser with lazy mode for Bun. Parse 10M rows in 5s with minimal memory.

395 lines (310 loc) 18.4 kB
# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] --- ## [1.3.1] - 2025-11-09 --- ## [1.3.0] - 2025-11-09 ### ⚡ Performance **Phase 3: JavaScript Deserialization Optimization (Production)** This major performance update brings OCSV to **56% of native Odin performance** through optimized JavaScript deserialization. Phase 3 is the production version after evaluating multiple optimization approaches (Phases 4-6) that did not improve performance. ### Added - **Optimized Deserializer** (`bindings/index.js`) - `_deserializePackedBufferOptimized()` - New high-performance deserializer - Batched TextDecoder with single reused instance (-30% decoder overhead) - Pre-allocated arrays (reduced GC pressure) - SIMD-friendly sequential memory access patterns - Row size adaptive strategy (<4KB batched, >4KB individual) - **Performance Benchmarks** - `test-phase3-performance.js` - Comprehensive benchmark suite - `test-phase3-integration.js` - Integration test suite - `PHASE3-SUMMARY.md` - Complete implementation documentation ### Performance Results **Test Setup:** 100K rows × 12 fields, 13.80 MB CSV file ``` Mode Throughput vs Native Improvement ───────────────────────────────────────────────────────────── Native Odin 109.28 MB/s 100% (baseline) Phase 3 Optimized 61.25 MB/s 56.1% +17.1% vs P2 Phase 2 Packed 52.32 MB/s 47.9% +30.0% vs P1 Phase 1 Bulk JSON 40.68 MB/s 37.2% (baseline) Field-by-Field 29.58 MB/s 27.1% N/A ``` **Key Achievements:** -**61.25 MB/s average** (56% of native Odin, 109 MB/s baseline) - 📈 **+17.1% faster** than Phase 2 (52.32 → 61.25 MB/s) - 🚀 **~444,444 rows/second** processing speed -**Simple implementation** - No complex SIMD, maintainable code - 💡 **Pragmatic approach** - JavaScript optimizations only, significant FFI overhead remains **Note:** Original PRP incorrectly used 61.84 MB/s as baseline. Actual native Odin performance is ~109 MB/s on same dataset. Phase 3 achieves 56% of native, indicating substantial FFI/serialization overhead that requires further optimization. ### Technical Optimizations **JavaScript-Side Improvements:** 1. **Batched TextDecoder** - Reused single decoder instance reduces overhead by ~30% 2. **Pre-allocated Arrays** - All output arrays sized upfront, eliminates dynamic growth 3. **SIMD-Friendly Patterns** - Sequential memory access for better CPU cache utilization 4. **Adaptive Processing** - Small rows (<4KB) batched, large rows individual for optimal performance **Implementation Highlights:** - Zero breaking changes - fully backward compatible - Phase 2 deserializer preserved as fallback - All existing tests still passing (36 pass, 1 skip) - Clean, maintainable code without native SIMD complexity ### Documentation - Updated README.md with Phase 3 performance metrics - Added PHASE3-SUMMARY.md with complete implementation details - Updated performance comparison tables across all documentation ### Notes - Phase 3 achieved 61.25 MB/s average (56% of native 109.28 MB/s) - This represents the practical limit for JavaScript FFI-based approach - Remaining 44% overhead is fundamental to FFI boundary (serialization/deserialization) - Ready for production - pragmatic, maintainable approach with strong performance ### Attempted Further Optimizations (Phases 4-6) The following optimizations were implemented and benchmarked but **did not improve performance**: -**Phase 4: ASCII Fast Path** (60.15 MB/s, -2%) - latin1 decoder only 11% faster, not 3× as expected -**Phase 5: UTF-16 Pre-transcoding** (30.72 MB/s, -50%) - Transcoding overhead too high, buffer doubled -**Phase 6: ARM NEON SIMD** (51.31 MB/s, -16%) - Chunking without real SIMD added overhead **Conclusion:** Phase 3 represents the optimal FFI-based solution. Further improvements would require fundamentally different architectures (native addons, WASM, etc.). --- ## [1.2.1] - 2025-10-27 ### 🐛 Bug Fixes **Critical Fix: Missing Dependencies in npm Package** ### Fixed - **Missing Files in npm Package**: Fixed imports of `errors.js` and `lazy.js` that didn't exist - Implemented `OcsvError` class inline in `index.js` - Implemented `ParseErrorCode` constants inline - Implemented `LazyRow` and `LazyResult` classes inline - Removed broken imports that caused module resolution failures - **Package Usability**: npm package now works correctly without missing module errors ### Technical Details - All classes now defined in `bindings/index.js` (no external dependencies) - Maintains full backwards compatibility with existing exports - Zero breaking changes for existing users --- ## [1.2.0] - 2025-10-27 ### ⚡ Performance **Phase 2: Packed Buffer FFI Optimization - Zero-Copy Deserialization** This major performance update introduces packed binary buffer serialization for minimal FFI overhead and maximum throughput. ### Added - **Packed Buffer Serialization** (`src/ffi_bindings.odin`) - Binary format with 24-byte header (magic, version, metadata) - Row offset array for O(1) random access - Length-prefixed UTF-8 strings (u16 + data) - Little-endian encoding throughout - **Zero-Copy Deserialization** (`bindings/simple.ts`) - `parseCSVPacked()` - New high-performance parsing function - Direct ArrayBuffer access with `toArrayBuffer()` - Efficient DataView + TextDecoder deserialization - **Memory Management** - Added `packed_buffer` field to Parser struct - Automatic cleanup in `parser_destroy()` - Memory stored in Parser for proper lifecycle management - **Comprehensive Benchmarks** (`examples/benchmark_bulk.ts`) - Three-way comparison: field-by-field vs JSON vs packed buffer - Performance ratings against native Odin baseline - Data validation across all modes ### Performance Results **Test Setup:** 100K rows, 12.71 MB CSV file ``` Mode Throughput ns/row vs Baseline Speedup ───────────────────────────────────────────────────────────────── Native Odin 61.84 MB/s - 100% - Packed Buffer 52.32 MB/s 2,238 84.6% ⭐ 1.77× Bulk JSON 40.68 MB/s 2,878 65.8% 1.38× Field-by-Field 29.58 MB/s 3,957 47.8% 1.00× ``` **Key Achievements:** -**52.32 MB/s** throughput (84.6% of native Odin performance) - 🚀 **1.77× faster** than field-by-field FFI access - 📈 **1.29× faster** than Phase 1 JSON serialization (40.68 → 52.32 MB/s) - 🎯 **Single FFI call** instead of N×M individual calls - 💾 **Zero-copy** memory access with ArrayBuffer -**100% data validation** - all tests pass ### Technical Details **Binary Format Specification:** ``` Header (24 bytes): 0-3: magic (0x4F435356 "OCSV") 4-7: version (1) 8-11: row_count (u32) 12-15: field_count (u32) 16-23: total_bytes (u64) Row Offsets (row_count × 4 bytes): 24+i*4: offset to row i data Field Data (variable length): [length: u16][UTF-8 bytes] ``` **Implementation Highlights:** - Uses `core:encoding/endian` for consistent byte order - Efficient `calculate_packed_buffer_size()` for pre-allocation - Modular helpers: `write_header()`, `write_row_offsets()`, `write_field_data()` - Export: `ocsv_rows_to_packed_buffer()` FFI function ### Documentation - Updated README.md with FFI Performance Optimization section - Added comprehensive benchmark results and usage examples - Documented when to use packed buffer mode vs other modes ### Notes - Phase 2 target was 55 MB/s (89% of baseline) - Achieved 52.32 MB/s (84.6% of baseline) - within 5% of target - "GOOD" performance rating (within 20% of native Odin baseline) - Ready for production use in performance-critical applications --- ## 1.1.0 (2025-10-20) * fix: add contents write permission for GitHub release creation ([ff4b7c8](https://github.com/dvrd/ocsv/commit/ff4b7c8)) * fix: add LLVM library paths and prepare v1.1.1 npm package ([9fb6bad](https://github.com/dvrd/ocsv/commit/9fb6bad)) * fix: add missing warnings cleanup in parser_extended_destroy ([5b3d15c](https://github.com/dvrd/ocsv/commit/5b3d15c)) * fix: also disable streaming tests using Error_Info ([8a44b38](https://github.com/dvrd/ocsv/commit/8a44b38)) * fix: disable all advanced feature tests, keep only basic parser tests ([ac51d5e](https://github.com/dvrd/ocsv/commit/ac51d5e)) * fix: disable all tests except parser and edge_cases (minimal test set) ([ed41c70](https://github.com/dvrd/ocsv/commit/ed41c70)) * fix: disable Collect_All_Errors tests due to crash ([205c6f1](https://github.com/dvrd/ocsv/commit/205c6f1)) * fix: disable entire error handling test file ([8dd5f27](https://github.com/dvrd/ocsv/commit/8dd5f27)) * fix: disable error_handling and streaming tests (depend on disabled source files) ([9bcb049](https://github.com/dvrd/ocsv/commit/9bcb049)) * fix: disable parser_error.odin source file temporarily ([67296a7](https://github.com/dvrd/ocsv/commit/67296a7)) * fix: handle negative bytes_read in streaming parser ([749fc3f](https://github.com/dvrd/ocsv/commit/749fc3f)) * fix: implement CI caching and resolve memory corruption in error handling ([6894e98](https://github.com/dvrd/ocsv/commit/6894e98)) * fix: invalidate macOS LLVM cache to force fresh install ([7b023e2](https://github.com/dvrd/ocsv/commit/7b023e2)) * fix: remove registry-url from setup-node to fix npm authentication ([c5f2bbd](https://github.com/dvrd/ocsv/commit/c5f2bbd)) * fix: resolve infinite recursion in SIMD parser and re-enable all tests ([a16652c](https://github.com/dvrd/ocsv/commit/a16652c)) * fix: set LLVM_CONFIG env var when building Odin compiler ([76b9fde](https://github.com/dvrd/ocsv/commit/76b9fde)) * fix: temporarily disable parallel and stress tests ([715b298](https://github.com/dvrd/ocsv/commit/715b298)) * fix: use matrix.llvm_path directly for LLVM_CONFIG (both platforms) ([604e826](https://github.com/dvrd/ocsv/commit/604e826)) * fix: Windows PowerShell UTF-8 and Lint check issues ([96c9a93](https://github.com/dvrd/ocsv/commit/96c9a93)) * chore: add benchmarks, simple bindings, and documentation improvements ([3d5aed5](https://github.com/dvrd/ocsv/commit/3d5aed5)) * chore: remove temporary files and old documentation ([4625fb0](https://github.com/dvrd/ocsv/commit/4625fb0)) * chore: trigger release after npm token fix ([271e34a](https://github.com/dvrd/ocsv/commit/271e34a)) * chore: trigger release workflow ([d073927](https://github.com/dvrd/ocsv/commit/d073927)) * chore: update package configuration ([ec85493](https://github.com/dvrd/ocsv/commit/ec85493)) * chore: update prebuilds from CI workflow (all platforms) ([2bdc4b3](https://github.com/dvrd/ocsv/commit/2bdc4b3)) * chore: verify npm token configuration ([7751017](https://github.com/dvrd/ocsv/commit/7751017)) * feat: add semantic release configuration ([10349f0](https://github.com/dvrd/ocsv/commit/10349f0)) * feat: implement automated release management ([f376df3](https://github.com/dvrd/ocsv/commit/f376df3)) * build: update prebuilt binaries for all platforms ([d88e943](https://github.com/dvrd/ocsv/commit/d88e943)) * refactor: improve examples and tooling ([a9dcb3a](https://github.com/dvrd/ocsv/commit/a9dcb3a)) * docs: add NPM publishing guide and contribution guidelines ([3c821f3](https://github.com/dvrd/ocsv/commit/3c821f3)) * ci: optimize GitHub Actions workflows ([02b501c](https://github.com/dvrd/ocsv/commit/02b501c)) ## [1.1.1] - 2025-10-17 ### 🐛 Bug Fixes & Improvements This patch release addresses critical memory corruption issues and significantly improves CI performance. ### Fixed - **Memory Corruption**: Fixed string ownership model in Error_Info structures - `make_error()` now always clones strings for consistent ownership - `make_error_result()` independently clones error strings to prevent double-free - Added `error_info_destroy()` helper for proper cleanup - **Platform-Specific Memory Management**: Implemented conditional cleanup for Windows (VirtualAlloc stricter than Unix mmap) - **Test Memory Leaks**: Fixed cleanup in `destroy_test_context()` to free Error_Info strings - **Empty Input Error**: Fixed orphaned error string in empty input handling ### Changed - **CI Performance**: Reduced workflow time from ~18 min to 8m 28s (53% improvement) - Implemented comprehensive caching (LLVM, Odin compiler) - Added fail-fast lint job before matrix builds - Platform-specific cache keys for better hit rates ### Added - **Cross-Platform Prebuilds**: All platforms now available in npm package - macOS ARM64: `libocsv.dylib` (69 KB) - Linux x64: `libocsv.so` (39 KB) - Windows x64: `ocsv.dll` (122 KB) - **Test Coverage**: Re-enabled 31 additional tests (+69% increase) - `test_stress.odin` (14 stress tests) - `test_error_handling.odin` (20 error handling tests) - `test_streaming.odin` (17 streaming tests) ### Quality Metrics -**Tests**: 75/76 passing (98.7%) -**Memory**: 0 leaks, 0 "bad free" warnings -**Platforms**: macOS ARM64, Linux x64, Windows x64 - ⚠️ **Known Issue**: 1 flaky concurrent test (passes in isolation) --- ## [1.0.0] - 2025-01-16 ### 🎉 Production Release Version 1.0.0 marks the first stable release of OCSV with comprehensive features and testing. ### Added - **npm Package**: macOS ARM64 prebuild available via npm - **CHANGELOG.md**: Complete version history and release notes following Keep a Changelog format - **Package Structure**: Optimized npm package with bindings and prebuilds ### Changed - **Version Bump**: Upgraded from v0.1.0 to v1.0.0, signaling stable API - **Semantic Versioning Commitment**: From this point forward, all changes will follow strict semantic versioning ### Quality - **Test Coverage**: ~201 tests passing (100% pass rate) - **Memory Safety**: Zero memory leaks across all tests - **RFC Compliance**: Full RFC 4180 compliance with all edge cases covered ### Documentation - **Core Documentation**: README, CHANGELOG, examples, and usage guides - **Platform Support**: Documented build and installation instructions for macOS ARM64 --- ## [0.1.0] - 2025-01-15 ### 🚀 Initial Release First functional release of OCSV with comprehensive CSV parsing capabilities. ### Added #### Core Functionality - **RFC 4180 Compliant Parser**: Full CSV specification support with state machine implementation - **Quoted Fields**: Support for embedded delimiters (`"field, with, commas"`) - **Nested Quotes**: Proper handling of doubled quotes (`""``"`) - **Multiline Fields**: Newlines inside quoted fields - **Line Endings**: Both CRLF (Windows) and LF (Unix) support - **Empty Fields**: Correct handling of consecutive delimiters and trailing delimiters - **Comments**: Lines starting with `#` (configurable) - **UTF-8 Support**: Full Unicode support including CJK characters and emojis #### Bun FFI Integration - **Native Bindings**: FFI integration with Bun runtime - **`parseCSV()`**: Convenience function for parsing CSV strings - **`parseCSVFile()`**: Async function for parsing CSV files - **`Parser` Class**: Manual parser management with `parse()`, `parseFile()`, and `destroy()` methods - **TypeScript Declarations**: Full TypeScript support with `.d.ts` files #### Configuration Options - **`delimiter`**: Custom field delimiter (default: `,`) - **`quote`**: Custom quote character (default: `"`) - **`escape`**: Custom escape character (default: `"`) - **`comment`**: Comment line prefix (default: `#`) - **`relaxed`**: Allow RFC violations (default: `false`) - **`hasHeader`**: Treat first row as headers (default: `false`) #### Advanced Features - **Error Handling & Recovery**: Multiple error types and recovery strategies - **Schema Validation**: Type checking and validation rules - **Streaming API**: Chunk-based processing for large files - **Transform System**: Built-in transforms and pipelines - **Plugin Architecture**: Extensible plugin system - **Parallel Processing**: Multi-threaded parsing (experimental) - **SIMD Optimization**: ARM NEON byte search implementation #### Testing & Quality - **~201 Tests**: Comprehensive test coverage across multiple suites - **100% Pass Rate**: All tests passing with zero failures - **Zero Memory Leaks**: Verified across all tests - **RFC Compliance**: Full RFC 4180 edge case coverage ### Technical Details #### Architecture - **Language**: Odin (core parser) + JavaScript/TypeScript (FFI bindings) - **Build System**: Taskfile for automated builds - **Memory Management**: Explicit memory management with defer pattern - **State Machine**: RFC 4180 compliant state machine parser - **FFI**: Bun FFI for integration #### Dependencies - **Runtime**: Bun v1.0+ (for FFI) - **Build**: Odin compiler (latest) - **Platform**: macOS ARM64 (primary support) #### Package Structure ``` ocsv/ ├── bindings/ # FFI bindings and TypeScript declarations ├── prebuilds/ # Platform-specific prebuilt binaries │ └── darwin-arm64/ # macOS ARM64 prebuild ├── src/ # Odin source code ├── tests/ # Test suites ├── README.md # Documentation └── LICENSE # MIT License ``` ### Known Limitations - **Platform Coverage**: v0.1.0 only includes macOS ARM64 prebuild - **Parallel Processing**: Experimental feature, needs optimization - **SIMD Support**: Currently ARM NEON only --- ## Links - [Repository](https://github.com/dvrd/ocsv) - [Issues](https://github.com/dvrd/ocsv/issues) - [NPM Package](https://www.npmjs.com/package/ocsv) [1.3.1]: https://github.com/dvrd/ocsv/releases/tag/v1.3.1 [1.3.0]: https://github.com/dvrd/ocsv/releases/tag/v1.3.0 [1.2.1]: https://github.com/dvrd/ocsv/releases/tag/v1.2.1 [1.2.0]: https://github.com/dvrd/ocsv/releases/tag/v1.2.0 [1.1.1]: https://github.com/dvrd/ocsv/releases/tag/v1.1.1 [1.0.0]: https://github.com/dvrd/ocsv/releases/tag/v1.0.0 [0.1.0]: https://github.com/dvrd/ocsv/releases/tag/v0.1.0