UNPKG

@himorishige/noren-core

Version:

Core PII detection, masking, and tokenization library built on Web Standards

640 lines (483 loc) • 20.4 kB
# @himorishige/noren-core [![npm version](https://img.shields.io/npm/v/@himorishige/noren-core.svg)](https://www.npmjs.com/package/@himorishige/noren-core) [![Bundle Size](https://img.shields.io/bundlephobia/minzip/@himorishige/noren-core.svg)](https://bundlephobia.com/package/@himorishige/noren-core) **Fast, lightweight PII detection and masking library built on Web Standards** The core library of the Noren PII protection suite - designed for **simplicity**, **performance**, and **universal compatibility**. ## ✨ Key Features - šŸš€ **Ultra-lightweight**: 124KB bundled size (77% code reduction) - ⚔ **High performance**: 102K+ ops/sec with pre-compiled patterns - 🌐 **Web Standards**: Works everywhere (Node.js, Edge, Browsers) - šŸŽÆ **Smart detection**: Built-in patterns with confidence scoring - šŸ›”ļø **Advanced validation**: Context-aware false positive filtering with 3 strictness levels - šŸ“Š **JSON/NDJSON Support**: Native structured data detection with key-based matching - ⚔ **Prefilter optimization**: Fast screening before expensive regex operations - šŸ”’ **Enhanced security**: HMAC-based tokenization with 32-char minimum key - šŸ“¦ **Zero dependencies**: Pure JavaScript, no external deps - šŸŽšļø **Confidence scoring**: Rule-based detection accuracy control ## šŸš€ Installation ```bash npm install @himorishige/noren-core ``` ## šŸ“– Quick Start ### Basic Usage ```typescript import { Registry, redactText } from '@himorishige/noren-core' // Create registry with default settings const registry = new Registry({ defaultAction: 'mask' }) // Detect and mask PII const input = 'Contact: john@company.com, Card: 4242-4242-4242-4242' const result = await redactText(registry, input) console.log(result) // Output: Contact: [REDACTED:email], Card: [REDACTED:credit_card] ``` ### With Custom Rules ```typescript const registry = new Registry({ defaultAction: 'mask', enableConfidenceScoring: true, // Enhanced in v0.6.0+ validationStrictness: 'balanced', // New in v0.6.0: Advanced validation environment: 'production', // Smart defaults with context-aware filtering rules: { email: { action: 'mask' }, credit_card: { action: 'mask', preserveLast4: true } } }) const input = 'Email: user@company.com, Card: 4242-4242-4242-4242' const result = await redactText(registry, input) // Output: Email: [REDACTED:email], Card: **** **** **** 4242 ``` ### Tokenization ```typescript const registry = new Registry({ defaultAction: 'tokenize', hmacKey: 'your-secure-32-character-key-here-123456' // Min 32 chars required }) const input = 'User: alice@company.com' const result = await redactText(registry, input) // Output: User: TKN_EMAIL_AbC123XyZ... // Same input always produces same token const sameResult = await redactText(registry, input) // Tokens will be identical ``` ### Advanced Validation (v0.6.0+) Control false positive detection with context-aware validation: ```typescript const registry = new Registry({ defaultAction: 'mask', validationStrictness: 'balanced' // 'fast' | 'balanced' | 'strict' }) // Test data is automatically filtered out in balanced/strict modes const testInput = 'Test email: test@example.com, Real email: john@company.com' const result = await redactText(registry, testInput) // Output: Test email: test@example.com, Real email: [REDACTED:email] // Different strictness levels: // - 'fast': No validation (maximum performance) // - 'balanced': Filter test data and weak contexts (recommended) // - 'strict': Aggressive filtering with context requirements ``` ## šŸŽÆ Supported PII Types **Core Package**: | Type | Pattern | Example | Notes | |------|---------|---------|-------| | `email` | Email addresses | `john@company.com` | āœ“ Unicode support, validation | | `credit_card` | Credit card numbers (Luhn validated) | `4242-4242-4242-4242` | āœ“ Brand detection, validation | | `phone_e164` | International phone numbers | `+1-555-123-4567` | āœ“ Format validation | **Network Detection** (v0.6.0+): āš ļø **Breaking Change**: Network PII detection (IPv4/IPv6/MAC) has been moved to a dedicated plugin for better modularity: ```bash npm install @himorishige/noren-plugin-network ``` ```typescript import * as networkPlugin from '@himorishige/noren-plugin-network' const registry = new Registry({ defaultAction: 'mask' }) registry.use(networkPlugin.detectors, networkPlugin.maskers) // Now IPv4, IPv6, and MAC detection works const result = await redactText(registry, 'Server: 192.168.1.1, MAC: 00:11:22:33:44:55') // Output: Server: [REDACTED:ipv4], MAC: [REDACTED:mac] ``` ## šŸ“Š Stream Processing For large data processing: ```typescript import { createRedactionTransform } from '@himorishige/noren-core' const registry = new Registry({ defaultAction: 'mask' }) const transform = createRedactionTransform(registry) // Process any ReadableStream const inputStream = new ReadableStream({ start(controller) { controller.enqueue('Data with john@company.com...') controller.enqueue('More data with 4242-4242-4242-4242...') controller.close() } }) const outputStream = inputStream.pipeThrough(transform) // Collect results const reader = outputStream.getReader() const chunks = [] let done = false while (!done) { const { value, done: readerDone } = await reader.read() done = readerDone if (value) chunks.push(value) } console.log(chunks.join('')) // Output: Data with [REDACTED:email]...More data with [REDACTED:credit_card]... ``` ## šŸ”§ Advanced Configuration ### Data Types & Object Processing Noren processes **text strings only**. Objects and arrays must be converted to strings before processing: ```typescript import { Registry, redactText } from '@himorishige/noren-core' const registry = new Registry({ defaultAction: 'mask' }) // āŒ This will fail - objects not supported const badExample = { email: 'user@example.com' } // await redactText(registry, badExample) // Error: s.normalize is not a function // āœ… Convert to JSON string first const jsonString = JSON.stringify({ email: 'user@company.com', phone: '090-1234-5678' }) const result = await redactText(registry, jsonString) // Output: {"email":"[REDACTED:email]","phone":"•••-••••-••••"} // āœ… Custom object processing helper async function redactObject(registry, obj, options = {}) { if (typeof obj === 'string') { return await redactText(registry, obj, options) } if (Array.isArray(obj)) { const results = [] for (const item of obj) { results.push(await redactObject(registry, item, options)) } return results } if (obj && typeof obj === 'object') { const result = {} for (const [key, value] of Object.entries(obj)) { result[key] = await redactObject(registry, value, options) } return result } return obj // numbers, booleans, etc. returned as-is } // Process complex nested structures const complexData = { user: { email: 'user@company.com', phones: ['090-1111-2222', '03-3333-4444'] }, messages: ['Contact: admin@company.com', 'Phone: 080-5555-6666'] } const redacted = await redactObject(registry, complexData, { hmacKey: 'your-secure-32-character-key-here-123456' }) // Output: Nested objects with PII properly masked in string values only ``` ### Full-Width Character Support Noren automatically handles full-width (zenkaku) characters through Unicode NFKC normalization: ```typescript const registry = new Registry({ defaultAction: 'mask' }) // Full-width characters are automatically normalized before processing const fullWidthInput = 'Email: user@ļ½…ļ½˜ļ½ļ½ļ½ļ½Œļ½….ļ½ƒļ½ļ½ Phone: 090-1234-ļ¼•ļ¼–ļ¼—ļ¼˜' const result = await redactText(registry, fullWidthInput) // Output: Email: [REDACTED:email] Phone: •••-••••-•••• // Detection works the same as half-width equivalents const halfWidthInput = 'Email: user@company.com Phone: 090-1234-5678' const sameResult = await redactText(registry, halfWidthInput) // Both inputs produce equivalent masking results ``` ### Environment-Aware Processing ```typescript const registry = new Registry({ environment: 'development', // Automatically excludes test patterns allowDenyConfig: { allowList: ['test@company.com'], // Never treat as PII denyList: ['admin@'] // Always treat as PII } }) ``` ### Performance Tuning ```typescript const registry = new Registry({ enableConfidenceScoring: false, // Disable for maximum performance sensitivity: 'relaxed' // Less aggressive detection }) ``` ## 🌐 Plugin System Extend functionality with plugins: ```typescript // Use plugins for extended functionality import * as networkPlugin from '@himorishige/noren-plugin-network' import * as jpPlugin from '@himorishige/noren-plugin-jp' import * as securityPlugin from '@himorishige/noren-plugin-security' const registry = new Registry({ defaultAction: 'mask' }) // Add network detection (IPv4/IPv6/MAC) registry.use(networkPlugin.detectors, networkPlugin.maskers) // Add Japanese PII detection registry.use(jpPlugin.detectors, jpPlugin.maskers) // Add security token detection registry.use(securityPlugin.detectors, securityPlugin.maskers) ``` #### Plugin Validation Integration (v0.6.0+) Plugins automatically inherit the registry's validation settings: ```typescript const registry = new Registry({ defaultAction: 'mask', validationStrictness: 'balanced' // Applies to plugins too }) registry.use(jpPlugin.detectors, jpPlugin.maskers) // Plugin detections are validated using the same rules as core detectors const text = 'ćƒ†ć‚¹ćƒˆé›»č©±: 03-1234-5678, ęœ¬ē•Ŗé›»č©±: 03-9876-5432' const result = await redactText(registry, text) // Only real phone numbers are detected, test patterns are filtered out ``` ### Available Plugins - **[@himorishige/noren-plugin-network](../noren-plugin-network)**: IPv4/IPv6 addresses, MAC addresses **(Required for network detection in v0.6.0+)** - **[@himorishige/noren-plugin-jp](../noren-plugin-jp)**: Japanese phone numbers, postal codes, My Number - **[@himorishige/noren-plugin-us](../noren-plugin-us)**: US phone numbers, ZIP codes, SSNs - **[@himorishige/noren-plugin-security](../noren-plugin-security)**: HTTP headers, API tokens, cookies - **[@himorishige/noren-dict-reloader](../noren-dict-reloader)**: Dynamic policy reloading ## šŸ“Š JSON/Structured Data Processing Noren v0.5.0+ includes native support for JSON and NDJSON (newline-delimited JSON) processing: ```typescript const registry = new Registry({ defaultAction: 'mask', enableJsonDetection: true // Enable structured data processing }) // JSON object detection const jsonInput = JSON.stringify({ user: { email: 'admin@company.com', phone: '+1-555-123-4567', creditCard: '4242-4242-4242-4242' } }) const result = await redactText(registry, jsonInput) // Detects PII within JSON structure and provides path information // NDJSON processing const ndjsonInput = [ JSON.stringify({ id: 1, email: 'user1@company.com' }), JSON.stringify({ id: 2, email: 'user2@company.com' }) ].join('\n') const ndjsonResult = await redactText(registry, ndjsonInput) // Processes each JSON line independently ``` ### JSON Detection Features - **Key-based detection**: Enhanced accuracy using JSON key names as context - **Path tracking**: Provides full JSON path for detected PII (e.g., `$.user.email`) - **Nested objects**: Recursive detection in deeply nested structures - **NDJSON support**: Line-by-line processing for streaming data - **Type safety**: Validates JSON structure before processing ## šŸ”— MCP (Model Context Protocol) Integration Noren provides specialized support for MCP servers that communicate via JSON-RPC over stdio. This is particularly useful for AI tools like Claude Code that need to process communication with external services while protecting sensitive data. ### MCP Transform Stream For real-time stdio processing in MCP servers: ```typescript import { Registry, createMCPRedactionTransform, redactJsonRpcMessage } from '@himorishige/noren-core' // Create registry with comprehensive PII detection const registry = new Registry({ defaultAction: 'mask', validationStrictness: 'fast', // Optimized for real-time processing enableJsonDetection: true, rules: { email: { action: 'mask' }, api_key: { action: 'remove' }, jwt_token: { action: 'tokenize' } }, hmacKey: 'mcp-server-redaction-key-32-chars-minimum-length-required' }) // Create MCP-optimized transform stream const transform = createMCPRedactionTransform({ registry, policy: { defaultAction: 'mask' }, lineBufferSize: 64 * 1024 }) // Process stdio communication await process.stdin .pipeThrough(transform) .pipeTo(process.stdout) ``` ### JSON-RPC Message Processing For processing individual JSON-RPC messages: ```typescript // Process a JSON-RPC request const request = { jsonrpc: '2.0', method: 'getUserProfile', params: { email: 'user@company.com', phone: '+1-555-123-4567' }, id: 1 } const redacted = await redactJsonRpcMessage(request, { registry }) console.log(redacted) // Output: { // jsonrpc: '2.0', // method: 'getUserProfile', // params: { // email: '[REDACTED:email]', // phone: '•••-•••-••••' // }, // id: 1 // } ``` ### MCP Server Proxy Example Create a proxy server that automatically redacts PII from stdio communication: ```javascript #!/usr/bin/env node import { Registry, createMCPRedactionTransform } from '@himorishige/noren-core' import { Readable, Writable } from 'node:stream' class MCPRedactionProxy { constructor(options = {}) { this.registry = new Registry({ defaultAction: 'mask', enableJsonDetection: true, validationStrictness: 'fast' }) } async start() { const inputStream = Readable.toWeb(process.stdin) const outputStream = Writable.toWeb(process.stdout) const transform = createMCPRedactionTransform({ registry: this.registry, policy: { defaultAction: 'mask' } }) await inputStream .pipeThrough(transform) .pipeTo(outputStream) } } // Start the proxy const proxy = new MCPRedactionProxy() await proxy.start() ``` ### MCP Use Cases **1. AI Assistant Communication** - Protect user data in Claude Code AI interactions - Redact PII from external API communications - Safe logging of AI model conversations **2. Development Tools Integration** - IDE extensions with PII protection - Code analysis tools with privacy features - Debug logging with automatic data sanitization **3. CI/CD Pipeline Protection** - Build logs with PII redaction - Test data anonymization - Environment variable protection ### MCP Utilities The library also provides utility functions for MCP processing: ```typescript import { parseJsonLines, isValidJsonRpcMessage, extractSensitiveContent, containsJsonRpcPattern, getMessageType } from '@himorishige/noren-core' // Parse line-delimited JSON messages const messages = parseJsonLines(ndjsonString) // Validate JSON-RPC message format if (isValidJsonRpcMessage(message)) { const type = getMessageType(message) // 'request' | 'response' | 'notification' | 'error' } // Extract potentially sensitive content const sensitiveContent = extractSensitiveContent(jsonRpcMessage) ``` ## šŸ“š API Reference ### `Registry` Main class for PII detection and configuration. #### Constructor Options ```typescript interface RegistryOptions { defaultAction?: 'mask' | 'remove' | 'tokenize' rules?: Record<string, { action: Action, preserveLast4?: boolean }> hmacKey?: string // Required for tokenization environment?: 'production' | 'development' | 'test' allowDenyConfig?: AllowDenyConfig enableConfidenceScoring?: boolean enableJsonDetection?: boolean // New: Enable JSON/NDJSON processing sensitivity?: 'strict' | 'balanced' | 'relaxed' contextHints?: string[] // Keywords to improve detection validationStrictness?: 'fast' | 'balanced' | 'strict' // v0.6.0+: Context validation level } ``` #### Methods - `use(detectors, maskers, contextHints?)`: Add plugins - `detect(text, contextHints?)`: Detect PII (returns hits) - `maskerFor(type)`: Get masker for PII type ### `redactText(registry, input, overrides?)` Process text and apply redaction rules. ### `createRedactionTransform(registry, overrides?)` Create transform stream for large data processing. ## ⚔ Performance ### Benchmarks (v0.5.0) - **Bundle Size**: 124KB optimized distribution - **Processing Speed**: 102,229 operations/second (0.0098ms per iteration) - **Memory Efficiency**: Object pooling with automatic cleanup - **TypeScript Codebase**: 1,782 lines (40%+ reduction from v0.4.x) - **API Surface**: 14 exports (65% reduction for better tree-shaking) ### Best Practices 1. **Reuse Registry instances** - avoid creating new ones frequently 2. **Use streams** for large data processing 3. **Disable confidence scoring** for maximum performance 4. **Pre-compile patterns** by loading plugins at startup ## šŸ”’ Security Considerations ### HMAC Keys - **Minimum 32 characters** required (enforced in v0.5.0) - Store in environment variables, never in code - Use different keys per environment - Rotate keys regularly - Base64URL token format for better security ### Memory Safety - Automatic object pooling reduces GC pressure - Sensitive data is cleared from memory after processing - Configurable limits prevent DoS attacks ## šŸ›  Development Tools For advanced features like benchmarking and A/B testing: ```bash npm install @himorishige/noren-devtools ``` See [@himorishige/noren-devtools](../noren-devtools) for development and testing tools. ## šŸ”„ Version History ### v0.6.0 (Latest) - Advanced Validation & Architecture Optimization **🚨 Breaking Changes:** - **Network detection separation**: IPv4/IPv6/MAC detection moved to `@himorishige/noren-plugin-network` - **Smaller core bundle**: 35% reduction in core package size by removing network patterns - **Plugin-based architecture**: Better modularity and optional feature loading **šŸ›”ļø New Features:** - **Advanced validation system**: Context-aware false positive filtering with 3 strictness levels (`fast`/`balanced`/`strict`) - **Plugin validation integration**: Automatic validation for plugin-detected PII types with seamless inheritance - **šŸ‡ÆšŸ‡µ Enhanced Japanese language support**: Specialized validators and expanded context keywords for improved accuracy - **šŸ“‹ Debug utilities**: New `debugValidation()` function for detailed validation analysis - **⚔ Performance optimized**: Validation adds minimal overhead while significantly reducing false positives - **šŸŽÆ Context-aware filtering**: Smart detection of test data, examples, and weak contexts - **šŸ”„ Backward compatible**: All existing APIs work without changes (except network detection) **šŸ“¦ Migration Guide:** ```typescript // Before v0.6.0 (network detection included) const result = await redactText(registry, 'IP: 192.168.1.1') // v0.6.0+ (install network plugin) npm install @himorishige/noren-plugin-network import * as networkPlugin from '@himorishige/noren-plugin-network' registry.use(networkPlugin.detectors, networkPlugin.maskers) const result = await redactText(registry, 'IP: 192.168.1.1') ``` ### v0.5.0 - Performance & Structured Data Support - **JSON/NDJSON detection**: Native support for structured data with key-based matching - **Prefilter optimization**: Fast screening reduces processing time for non-PII text - **77% code reduction**: Streamlined from 8,153 to 1,782 lines - **Single-pass detection**: Unified pattern matching for better performance - **Optimized IPv6 parser**: 31% size reduction with enhanced validation - **Streamlined Hit Pool**: 47% size reduction with object pooling - **Reduced API surface**: 65% fewer exports for better tree-shaking - **Enhanced security**: Stricter boundaries and improved validation - **Code quality improvements**: Full TypeScript strict mode compliance ### v0.4.0 - Confidence Scoring & Advanced Features - Added confidence scoring system - Environment-aware processing - Enhanced HMAC security with 32-character minimum - Development tools package separation ## šŸ“„ License MIT License - see [LICENSE](../../LICENSE) for details. --- **Part of the [Noren](../../README.md) PII protection suite**