ocsv
Version:
High-performance RFC 4180 compliant CSV parser with lazy mode for Bun. Parse 10M rows in 5s with minimal memory.
395 lines (310 loc) • 18.4 kB
Markdown
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
---
## [1.3.1] - 2025-11-09
---
## [1.3.0] - 2025-11-09
### ⚡ Performance
**Phase 3: JavaScript Deserialization Optimization (Production)**
This major performance update brings OCSV to **56% of native Odin performance** through optimized JavaScript deserialization. Phase 3 is the production version after evaluating multiple optimization approaches (Phases 4-6) that did not improve performance.
### Added
- **Optimized Deserializer** (`bindings/index.js`)
- `_deserializePackedBufferOptimized()` - New high-performance deserializer
- Batched TextDecoder with single reused instance (-30% decoder overhead)
- Pre-allocated arrays (reduced GC pressure)
- SIMD-friendly sequential memory access patterns
- Row size adaptive strategy (<4KB batched, >4KB individual)
- **Performance Benchmarks**
- `test-phase3-performance.js` - Comprehensive benchmark suite
- `test-phase3-integration.js` - Integration test suite
- `PHASE3-SUMMARY.md` - Complete implementation documentation
### Performance Results
**Test Setup:** 100K rows × 12 fields, 13.80 MB CSV file
```
Mode Throughput vs Native Improvement
─────────────────────────────────────────────────────────────
Native Odin 109.28 MB/s 100% (baseline)
Phase 3 Optimized 61.25 MB/s 56.1% +17.1% vs P2
Phase 2 Packed 52.32 MB/s 47.9% +30.0% vs P1
Phase 1 Bulk JSON 40.68 MB/s 37.2% (baseline)
Field-by-Field 29.58 MB/s 27.1% N/A
```
**Key Achievements:**
- ⚡ **61.25 MB/s average** (56% of native Odin, 109 MB/s baseline)
- 📈 **+17.1% faster** than Phase 2 (52.32 → 61.25 MB/s)
- 🚀 **~444,444 rows/second** processing speed
- ✅ **Simple implementation** - No complex SIMD, maintainable code
- 💡 **Pragmatic approach** - JavaScript optimizations only, significant FFI overhead remains
**Note:** Original PRP incorrectly used 61.84 MB/s as baseline. Actual native Odin performance is ~109 MB/s on same dataset. Phase 3 achieves 56% of native, indicating substantial FFI/serialization overhead that requires further optimization.
### Technical Optimizations
**JavaScript-Side Improvements:**
1. **Batched TextDecoder** - Reused single decoder instance reduces overhead by ~30%
2. **Pre-allocated Arrays** - All output arrays sized upfront, eliminates dynamic growth
3. **SIMD-Friendly Patterns** - Sequential memory access for better CPU cache utilization
4. **Adaptive Processing** - Small rows (<4KB) batched, large rows individual for optimal performance
**Implementation Highlights:**
- Zero breaking changes - fully backward compatible
- Phase 2 deserializer preserved as fallback
- All existing tests still passing (36 pass, 1 skip)
- Clean, maintainable code without native SIMD complexity
### Documentation
- Updated README.md with Phase 3 performance metrics
- Added PHASE3-SUMMARY.md with complete implementation details
- Updated performance comparison tables across all documentation
### Notes
- Phase 3 achieved 61.25 MB/s average (56% of native 109.28 MB/s)
- This represents the practical limit for JavaScript FFI-based approach
- Remaining 44% overhead is fundamental to FFI boundary (serialization/deserialization)
- Ready for production - pragmatic, maintainable approach with strong performance
### Attempted Further Optimizations (Phases 4-6)
The following optimizations were implemented and benchmarked but **did not improve performance**:
- ❌ **Phase 4: ASCII Fast Path** (60.15 MB/s, -2%) - latin1 decoder only 11% faster, not 3× as expected
- ❌ **Phase 5: UTF-16 Pre-transcoding** (30.72 MB/s, -50%) - Transcoding overhead too high, buffer doubled
- ❌ **Phase 6: ARM NEON SIMD** (51.31 MB/s, -16%) - Chunking without real SIMD added overhead
**Conclusion:** Phase 3 represents the optimal FFI-based solution. Further improvements would require fundamentally different architectures (native addons, WASM, etc.).
---
## [1.2.1] - 2025-10-27
### 🐛 Bug Fixes
**Critical Fix: Missing Dependencies in npm Package**
### Fixed
- **Missing Files in npm Package**: Fixed imports of `errors.js` and `lazy.js` that didn't exist
- Implemented `OcsvError` class inline in `index.js`
- Implemented `ParseErrorCode` constants inline
- Implemented `LazyRow` and `LazyResult` classes inline
- Removed broken imports that caused module resolution failures
- **Package Usability**: npm package now works correctly without missing module errors
### Technical Details
- All classes now defined in `bindings/index.js` (no external dependencies)
- Maintains full backwards compatibility with existing exports
- Zero breaking changes for existing users
---
## [1.2.0] - 2025-10-27
### ⚡ Performance
**Phase 2: Packed Buffer FFI Optimization - Zero-Copy Deserialization**
This major performance update introduces packed binary buffer serialization for minimal FFI overhead and maximum throughput.
### Added
- **Packed Buffer Serialization** (`src/ffi_bindings.odin`)
- Binary format with 24-byte header (magic, version, metadata)
- Row offset array for O(1) random access
- Length-prefixed UTF-8 strings (u16 + data)
- Little-endian encoding throughout
- **Zero-Copy Deserialization** (`bindings/simple.ts`)
- `parseCSVPacked()` - New high-performance parsing function
- Direct ArrayBuffer access with `toArrayBuffer()`
- Efficient DataView + TextDecoder deserialization
- **Memory Management**
- Added `packed_buffer` field to Parser struct
- Automatic cleanup in `parser_destroy()`
- Memory stored in Parser for proper lifecycle management
- **Comprehensive Benchmarks** (`examples/benchmark_bulk.ts`)
- Three-way comparison: field-by-field vs JSON vs packed buffer
- Performance ratings against native Odin baseline
- Data validation across all modes
### Performance Results
**Test Setup:** 100K rows, 12.71 MB CSV file
```
Mode Throughput ns/row vs Baseline Speedup
─────────────────────────────────────────────────────────────────
Native Odin 61.84 MB/s - 100% -
Packed Buffer 52.32 MB/s 2,238 84.6% ⭐ 1.77×
Bulk JSON 40.68 MB/s 2,878 65.8% 1.38×
Field-by-Field 29.58 MB/s 3,957 47.8% 1.00×
```
**Key Achievements:**
- ⚡ **52.32 MB/s** throughput (84.6% of native Odin performance)
- 🚀 **1.77× faster** than field-by-field FFI access
- 📈 **1.29× faster** than Phase 1 JSON serialization (40.68 → 52.32 MB/s)
- 🎯 **Single FFI call** instead of N×M individual calls
- 💾 **Zero-copy** memory access with ArrayBuffer
- ✅ **100% data validation** - all tests pass
### Technical Details
**Binary Format Specification:**
```
Header (24 bytes):
0-3: magic (0x4F435356 "OCSV")
4-7: version (1)
8-11: row_count (u32)
12-15: field_count (u32)
16-23: total_bytes (u64)
Row Offsets (row_count × 4 bytes):
24+i*4: offset to row i data
Field Data (variable length):
[length: u16][UTF-8 bytes]
```
**Implementation Highlights:**
- Uses `core:encoding/endian` for consistent byte order
- Efficient `calculate_packed_buffer_size()` for pre-allocation
- Modular helpers: `write_header()`, `write_row_offsets()`, `write_field_data()`
- Export: `ocsv_rows_to_packed_buffer()` FFI function
### Documentation
- Updated README.md with FFI Performance Optimization section
- Added comprehensive benchmark results and usage examples
- Documented when to use packed buffer mode vs other modes
### Notes
- Phase 2 target was 55 MB/s (89% of baseline)
- Achieved 52.32 MB/s (84.6% of baseline) - within 5% of target
- "GOOD" performance rating (within 20% of native Odin baseline)
- Ready for production use in performance-critical applications
---
## 1.1.0 (2025-10-20)
* fix: add contents write permission for GitHub release creation ([ff4b7c8](https://github.com/dvrd/ocsv/commit/ff4b7c8))
* fix: add LLVM library paths and prepare v1.1.1 npm package ([9fb6bad](https://github.com/dvrd/ocsv/commit/9fb6bad))
* fix: add missing warnings cleanup in parser_extended_destroy ([5b3d15c](https://github.com/dvrd/ocsv/commit/5b3d15c))
* fix: also disable streaming tests using Error_Info ([8a44b38](https://github.com/dvrd/ocsv/commit/8a44b38))
* fix: disable all advanced feature tests, keep only basic parser tests ([ac51d5e](https://github.com/dvrd/ocsv/commit/ac51d5e))
* fix: disable all tests except parser and edge_cases (minimal test set) ([ed41c70](https://github.com/dvrd/ocsv/commit/ed41c70))
* fix: disable Collect_All_Errors tests due to crash ([205c6f1](https://github.com/dvrd/ocsv/commit/205c6f1))
* fix: disable entire error handling test file ([8dd5f27](https://github.com/dvrd/ocsv/commit/8dd5f27))
* fix: disable error_handling and streaming tests (depend on disabled source files) ([9bcb049](https://github.com/dvrd/ocsv/commit/9bcb049))
* fix: disable parser_error.odin source file temporarily ([67296a7](https://github.com/dvrd/ocsv/commit/67296a7))
* fix: handle negative bytes_read in streaming parser ([749fc3f](https://github.com/dvrd/ocsv/commit/749fc3f))
* fix: implement CI caching and resolve memory corruption in error handling ([6894e98](https://github.com/dvrd/ocsv/commit/6894e98))
* fix: invalidate macOS LLVM cache to force fresh install ([7b023e2](https://github.com/dvrd/ocsv/commit/7b023e2))
* fix: remove registry-url from setup-node to fix npm authentication ([c5f2bbd](https://github.com/dvrd/ocsv/commit/c5f2bbd))
* fix: resolve infinite recursion in SIMD parser and re-enable all tests ([a16652c](https://github.com/dvrd/ocsv/commit/a16652c))
* fix: set LLVM_CONFIG env var when building Odin compiler ([76b9fde](https://github.com/dvrd/ocsv/commit/76b9fde))
* fix: temporarily disable parallel and stress tests ([715b298](https://github.com/dvrd/ocsv/commit/715b298))
* fix: use matrix.llvm_path directly for LLVM_CONFIG (both platforms) ([604e826](https://github.com/dvrd/ocsv/commit/604e826))
* fix: Windows PowerShell UTF-8 and Lint check issues ([96c9a93](https://github.com/dvrd/ocsv/commit/96c9a93))
* chore: add benchmarks, simple bindings, and documentation improvements ([3d5aed5](https://github.com/dvrd/ocsv/commit/3d5aed5))
* chore: remove temporary files and old documentation ([4625fb0](https://github.com/dvrd/ocsv/commit/4625fb0))
* chore: trigger release after npm token fix ([271e34a](https://github.com/dvrd/ocsv/commit/271e34a))
* chore: trigger release workflow ([d073927](https://github.com/dvrd/ocsv/commit/d073927))
* chore: update package configuration ([ec85493](https://github.com/dvrd/ocsv/commit/ec85493))
* chore: update prebuilds from CI workflow (all platforms) ([2bdc4b3](https://github.com/dvrd/ocsv/commit/2bdc4b3))
* chore: verify npm token configuration ([7751017](https://github.com/dvrd/ocsv/commit/7751017))
* feat: add semantic release configuration ([10349f0](https://github.com/dvrd/ocsv/commit/10349f0))
* feat: implement automated release management ([f376df3](https://github.com/dvrd/ocsv/commit/f376df3))
* build: update prebuilt binaries for all platforms ([d88e943](https://github.com/dvrd/ocsv/commit/d88e943))
* refactor: improve examples and tooling ([a9dcb3a](https://github.com/dvrd/ocsv/commit/a9dcb3a))
* docs: add NPM publishing guide and contribution guidelines ([3c821f3](https://github.com/dvrd/ocsv/commit/3c821f3))
* ci: optimize GitHub Actions workflows ([02b501c](https://github.com/dvrd/ocsv/commit/02b501c))
## [1.1.1] - 2025-10-17
### 🐛 Bug Fixes & Improvements
This patch release addresses critical memory corruption issues and significantly improves CI performance.
### Fixed
- **Memory Corruption**: Fixed string ownership model in Error_Info structures
- `make_error()` now always clones strings for consistent ownership
- `make_error_result()` independently clones error strings to prevent double-free
- Added `error_info_destroy()` helper for proper cleanup
- **Platform-Specific Memory Management**: Implemented conditional cleanup for Windows (VirtualAlloc stricter than Unix mmap)
- **Test Memory Leaks**: Fixed cleanup in `destroy_test_context()` to free Error_Info strings
- **Empty Input Error**: Fixed orphaned error string in empty input handling
### Changed
- **CI Performance**: Reduced workflow time from ~18 min to 8m 28s (53% improvement)
- Implemented comprehensive caching (LLVM, Odin compiler)
- Added fail-fast lint job before matrix builds
- Platform-specific cache keys for better hit rates
### Added
- **Cross-Platform Prebuilds**: All platforms now available in npm package
- macOS ARM64: `libocsv.dylib` (69 KB)
- Linux x64: `libocsv.so` (39 KB)
- Windows x64: `ocsv.dll` (122 KB)
- **Test Coverage**: Re-enabled 31 additional tests (+69% increase)
- `test_stress.odin` (14 stress tests)
- `test_error_handling.odin` (20 error handling tests)
- `test_streaming.odin` (17 streaming tests)
### Quality Metrics
- ✅ **Tests**: 75/76 passing (98.7%)
- ✅ **Memory**: 0 leaks, 0 "bad free" warnings
- ✅ **Platforms**: macOS ARM64, Linux x64, Windows x64
- ⚠️ **Known Issue**: 1 flaky concurrent test (passes in isolation)
---
## [1.0.0] - 2025-01-16
### 🎉 Production Release
Version 1.0.0 marks the first stable release of OCSV with comprehensive features and testing.
### Added
- **npm Package**: macOS ARM64 prebuild available via npm
- **CHANGELOG.md**: Complete version history and release notes following Keep a Changelog format
- **Package Structure**: Optimized npm package with bindings and prebuilds
### Changed
- **Version Bump**: Upgraded from v0.1.0 to v1.0.0, signaling stable API
- **Semantic Versioning Commitment**: From this point forward, all changes will follow strict semantic versioning
### Quality
- **Test Coverage**: ~201 tests passing (100% pass rate)
- **Memory Safety**: Zero memory leaks across all tests
- **RFC Compliance**: Full RFC 4180 compliance with all edge cases covered
### Documentation
- **Core Documentation**: README, CHANGELOG, examples, and usage guides
- **Platform Support**: Documented build and installation instructions for macOS ARM64
---
## [0.1.0] - 2025-01-15
### 🚀 Initial Release
First functional release of OCSV with comprehensive CSV parsing capabilities.
### Added
#### Core Functionality
- **RFC 4180 Compliant Parser**: Full CSV specification support with state machine implementation
- **Quoted Fields**: Support for embedded delimiters (`"field, with, commas"`)
- **Nested Quotes**: Proper handling of doubled quotes (`""` → `"`)
- **Multiline Fields**: Newlines inside quoted fields
- **Line Endings**: Both CRLF (Windows) and LF (Unix) support
- **Empty Fields**: Correct handling of consecutive delimiters and trailing delimiters
- **Comments**: Lines starting with `#` (configurable)
- **UTF-8 Support**: Full Unicode support including CJK characters and emojis
#### Bun FFI Integration
- **Native Bindings**: FFI integration with Bun runtime
- **`parseCSV()`**: Convenience function for parsing CSV strings
- **`parseCSVFile()`**: Async function for parsing CSV files
- **`Parser` Class**: Manual parser management with `parse()`, `parseFile()`, and `destroy()` methods
- **TypeScript Declarations**: Full TypeScript support with `.d.ts` files
#### Configuration Options
- **`delimiter`**: Custom field delimiter (default: `,`)
- **`quote`**: Custom quote character (default: `"`)
- **`escape`**: Custom escape character (default: `"`)
- **`comment`**: Comment line prefix (default: `#`)
- **`relaxed`**: Allow RFC violations (default: `false`)
- **`hasHeader`**: Treat first row as headers (default: `false`)
#### Advanced Features
- **Error Handling & Recovery**: Multiple error types and recovery strategies
- **Schema Validation**: Type checking and validation rules
- **Streaming API**: Chunk-based processing for large files
- **Transform System**: Built-in transforms and pipelines
- **Plugin Architecture**: Extensible plugin system
- **Parallel Processing**: Multi-threaded parsing (experimental)
- **SIMD Optimization**: ARM NEON byte search implementation
#### Testing & Quality
- **~201 Tests**: Comprehensive test coverage across multiple suites
- **100% Pass Rate**: All tests passing with zero failures
- **Zero Memory Leaks**: Verified across all tests
- **RFC Compliance**: Full RFC 4180 edge case coverage
### Technical Details
#### Architecture
- **Language**: Odin (core parser) + JavaScript/TypeScript (FFI bindings)
- **Build System**: Taskfile for automated builds
- **Memory Management**: Explicit memory management with defer pattern
- **State Machine**: RFC 4180 compliant state machine parser
- **FFI**: Bun FFI for integration
#### Dependencies
- **Runtime**: Bun v1.0+ (for FFI)
- **Build**: Odin compiler (latest)
- **Platform**: macOS ARM64 (primary support)
#### Package Structure
```
ocsv/
├── bindings/ # FFI bindings and TypeScript declarations
├── prebuilds/ # Platform-specific prebuilt binaries
│ └── darwin-arm64/ # macOS ARM64 prebuild
├── src/ # Odin source code
├── tests/ # Test suites
├── README.md # Documentation
└── LICENSE # MIT License
```
### Known Limitations
- **Platform Coverage**: v0.1.0 only includes macOS ARM64 prebuild
- **Parallel Processing**: Experimental feature, needs optimization
- **SIMD Support**: Currently ARM NEON only
---
## Links
- [Repository](https://github.com/dvrd/ocsv)
- [Issues](https://github.com/dvrd/ocsv/issues)
- [NPM Package](https://www.npmjs.com/package/ocsv)
[1.3.1]: https://github.com/dvrd/ocsv/releases/tag/v1.3.1
[1.3.0]: https://github.com/dvrd/ocsv/releases/tag/v1.3.0
[1.2.1]: https://github.com/dvrd/ocsv/releases/tag/v1.2.1
[1.2.0]: https://github.com/dvrd/ocsv/releases/tag/v1.2.0
[1.1.1]: https://github.com/dvrd/ocsv/releases/tag/v1.1.1
[1.0.0]: https://github.com/dvrd/ocsv/releases/tag/v1.0.0
[0.1.0]: https://github.com/dvrd/ocsv/releases/tag/v0.1.0