@himorishige/noren-core
Version:
Core PII detection, masking, and tokenization library built on Web Standards
640 lines (483 loc) ⢠20.4 kB
Markdown
# @himorishige/noren-core
[](https://www.npmjs.com/package/@himorishige/noren-core)
[](https://bundlephobia.com/package/@himorishige/noren-core)
**Fast, lightweight PII detection and masking library built on Web Standards**
The core library of the Noren PII protection suite - designed for **simplicity**, **performance**, and **universal compatibility**.
## ⨠Key Features
- š **Ultra-lightweight**: 124KB bundled size (77% code reduction)
- ā” **High performance**: 102K+ ops/sec with pre-compiled patterns
- š **Web Standards**: Works everywhere (Node.js, Edge, Browsers)
- šÆ **Smart detection**: Built-in patterns with confidence scoring
- š”ļø **Advanced validation**: Context-aware false positive filtering with 3 strictness levels
- š **JSON/NDJSON Support**: Native structured data detection with key-based matching
- ā” **Prefilter optimization**: Fast screening before expensive regex operations
- š **Enhanced security**: HMAC-based tokenization with 32-char minimum key
- š¦ **Zero dependencies**: Pure JavaScript, no external deps
- šļø **Confidence scoring**: Rule-based detection accuracy control
## š Installation
```bash
npm install @himorishige/noren-core
```
## š Quick Start
### Basic Usage
```typescript
import { Registry, redactText } from '@himorishige/noren-core'
// Create registry with default settings
const registry = new Registry({
defaultAction: 'mask'
})
// Detect and mask PII
const input = 'Contact: john@company.com, Card: 4242-4242-4242-4242'
const result = await redactText(registry, input)
console.log(result)
// Output: Contact: [REDACTED:email], Card: [REDACTED:credit_card]
```
### With Custom Rules
```typescript
const registry = new Registry({
defaultAction: 'mask',
enableConfidenceScoring: true, // Enhanced in v0.6.0+
validationStrictness: 'balanced', // New in v0.6.0: Advanced validation
environment: 'production', // Smart defaults with context-aware filtering
rules: {
email: { action: 'mask' },
credit_card: { action: 'mask', preserveLast4: true }
}
})
const input = 'Email: user@company.com, Card: 4242-4242-4242-4242'
const result = await redactText(registry, input)
// Output: Email: [REDACTED:email], Card: **** **** **** 4242
```
### Tokenization
```typescript
const registry = new Registry({
defaultAction: 'tokenize',
hmacKey: 'your-secure-32-character-key-here-123456' // Min 32 chars required
})
const input = 'User: alice@company.com'
const result = await redactText(registry, input)
// Output: User: TKN_EMAIL_AbC123XyZ...
// Same input always produces same token
const sameResult = await redactText(registry, input)
// Tokens will be identical
```
### Advanced Validation (v0.6.0+)
Control false positive detection with context-aware validation:
```typescript
const registry = new Registry({
defaultAction: 'mask',
validationStrictness: 'balanced' // 'fast' | 'balanced' | 'strict'
})
// Test data is automatically filtered out in balanced/strict modes
const testInput = 'Test email: test@example.com, Real email: john@company.com'
const result = await redactText(registry, testInput)
// Output: Test email: test@example.com, Real email: [REDACTED:email]
// Different strictness levels:
// - 'fast': No validation (maximum performance)
// - 'balanced': Filter test data and weak contexts (recommended)
// - 'strict': Aggressive filtering with context requirements
```
## šÆ Supported PII Types
**Core Package**:
| Type | Pattern | Example | Notes |
|------|---------|---------|-------|
| `email` | Email addresses | `john@company.com` | ā Unicode support, validation |
| `credit_card` | Credit card numbers (Luhn validated) | `4242-4242-4242-4242` | ā Brand detection, validation |
| `phone_e164` | International phone numbers | `+1-555-123-4567` | ā Format validation |
**Network Detection** (v0.6.0+):
ā ļø **Breaking Change**: Network PII detection (IPv4/IPv6/MAC) has been moved to a dedicated plugin for better modularity:
```bash
npm install @himorishige/noren-plugin-network
```
```typescript
import * as networkPlugin from '@himorishige/noren-plugin-network'
const registry = new Registry({ defaultAction: 'mask' })
registry.use(networkPlugin.detectors, networkPlugin.maskers)
// Now IPv4, IPv6, and MAC detection works
const result = await redactText(registry, 'Server: 192.168.1.1, MAC: 00:11:22:33:44:55')
// Output: Server: [REDACTED:ipv4], MAC: [REDACTED:mac]
```
## š Stream Processing
For large data processing:
```typescript
import { createRedactionTransform } from '@himorishige/noren-core'
const registry = new Registry({ defaultAction: 'mask' })
const transform = createRedactionTransform(registry)
// Process any ReadableStream
const inputStream = new ReadableStream({
start(controller) {
controller.enqueue('Data with john@company.com...')
controller.enqueue('More data with 4242-4242-4242-4242...')
controller.close()
}
})
const outputStream = inputStream.pipeThrough(transform)
// Collect results
const reader = outputStream.getReader()
const chunks = []
let done = false
while (!done) {
const { value, done: readerDone } = await reader.read()
done = readerDone
if (value) chunks.push(value)
}
console.log(chunks.join(''))
// Output: Data with [REDACTED:email]...More data with [REDACTED:credit_card]...
```
## š§ Advanced Configuration
### Data Types & Object Processing
Noren processes **text strings only**. Objects and arrays must be converted to strings before processing:
```typescript
import { Registry, redactText } from '@himorishige/noren-core'
const registry = new Registry({ defaultAction: 'mask' })
// ā This will fail - objects not supported
const badExample = { email: 'user@example.com' }
// await redactText(registry, badExample) // Error: s.normalize is not a function
// ā
Convert to JSON string first
const jsonString = JSON.stringify({ email: 'user@company.com', phone: '090-1234-5678' })
const result = await redactText(registry, jsonString)
// Output: {"email":"[REDACTED:email]","phone":"ā¢ā¢ā¢-ā¢ā¢ā¢ā¢-ā¢ā¢ā¢ā¢"}
// ā
Custom object processing helper
async function redactObject(registry, obj, options = {}) {
if (typeof obj === 'string') {
return await redactText(registry, obj, options)
}
if (Array.isArray(obj)) {
const results = []
for (const item of obj) {
results.push(await redactObject(registry, item, options))
}
return results
}
if (obj && typeof obj === 'object') {
const result = {}
for (const [key, value] of Object.entries(obj)) {
result[key] = await redactObject(registry, value, options)
}
return result
}
return obj // numbers, booleans, etc. returned as-is
}
// Process complex nested structures
const complexData = {
user: { email: 'user@company.com', phones: ['090-1111-2222', '03-3333-4444'] },
messages: ['Contact: admin@company.com', 'Phone: 080-5555-6666']
}
const redacted = await redactObject(registry, complexData, {
hmacKey: 'your-secure-32-character-key-here-123456'
})
// Output: Nested objects with PII properly masked in string values only
```
### Full-Width Character Support
Noren automatically handles full-width (zenkaku) characters through Unicode NFKC normalization:
```typescript
const registry = new Registry({ defaultAction: 'mask' })
// Full-width characters are automatically normalized before processing
const fullWidthInput = 'Email: ļ½ļ½ļ½
ļ½@ļ½
ļ½ļ½ļ½ļ½ļ½ļ½
.ļ½ļ½ļ½ Phone: ļ¼ļ¼ļ¼-ļ¼ļ¼ļ¼ļ¼-ļ¼ļ¼ļ¼ļ¼'
const result = await redactText(registry, fullWidthInput)
// Output: Email: [REDACTED:email] Phone: ā¢ā¢ā¢-ā¢ā¢ā¢ā¢-ā¢ā¢ā¢ā¢
// Detection works the same as half-width equivalents
const halfWidthInput = 'Email: user@company.com Phone: 090-1234-5678'
const sameResult = await redactText(registry, halfWidthInput)
// Both inputs produce equivalent masking results
```
### Environment-Aware Processing
```typescript
const registry = new Registry({
environment: 'development', // Automatically excludes test patterns
allowDenyConfig: {
allowList: ['test@company.com'], // Never treat as PII
denyList: ['admin@'] // Always treat as PII
}
})
```
### Performance Tuning
```typescript
const registry = new Registry({
enableConfidenceScoring: false, // Disable for maximum performance
sensitivity: 'relaxed' // Less aggressive detection
})
```
## š Plugin System
Extend functionality with plugins:
```typescript
// Use plugins for extended functionality
import * as networkPlugin from '@himorishige/noren-plugin-network'
import * as jpPlugin from '@himorishige/noren-plugin-jp'
import * as securityPlugin from '@himorishige/noren-plugin-security'
const registry = new Registry({ defaultAction: 'mask' })
// Add network detection (IPv4/IPv6/MAC)
registry.use(networkPlugin.detectors, networkPlugin.maskers)
// Add Japanese PII detection
registry.use(jpPlugin.detectors, jpPlugin.maskers)
// Add security token detection
registry.use(securityPlugin.detectors, securityPlugin.maskers)
```
#### Plugin Validation Integration (v0.6.0+)
Plugins automatically inherit the registry's validation settings:
```typescript
const registry = new Registry({
defaultAction: 'mask',
validationStrictness: 'balanced' // Applies to plugins too
})
registry.use(jpPlugin.detectors, jpPlugin.maskers)
// Plugin detections are validated using the same rules as core detectors
const text = 'ćć¹ćé»č©±: 03-1234-5678, ę¬ēŖé»č©±: 03-9876-5432'
const result = await redactText(registry, text)
// Only real phone numbers are detected, test patterns are filtered out
```
### Available Plugins
- **[@himorishige/noren-plugin-network](../noren-plugin-network)**: IPv4/IPv6 addresses, MAC addresses **(Required for network detection in v0.6.0+)**
- **[@himorishige/noren-plugin-jp](../noren-plugin-jp)**: Japanese phone numbers, postal codes, My Number
- **[@himorishige/noren-plugin-us](../noren-plugin-us)**: US phone numbers, ZIP codes, SSNs
- **[@himorishige/noren-plugin-security](../noren-plugin-security)**: HTTP headers, API tokens, cookies
- **[@himorishige/noren-dict-reloader](../noren-dict-reloader)**: Dynamic policy reloading
## š JSON/Structured Data Processing
Noren v0.5.0+ includes native support for JSON and NDJSON (newline-delimited JSON) processing:
```typescript
const registry = new Registry({
defaultAction: 'mask',
enableJsonDetection: true // Enable structured data processing
})
// JSON object detection
const jsonInput = JSON.stringify({
user: {
email: 'admin@company.com',
phone: '+1-555-123-4567',
creditCard: '4242-4242-4242-4242'
}
})
const result = await redactText(registry, jsonInput)
// Detects PII within JSON structure and provides path information
// NDJSON processing
const ndjsonInput = [
JSON.stringify({ id: 1, email: 'user1@company.com' }),
JSON.stringify({ id: 2, email: 'user2@company.com' })
].join('\n')
const ndjsonResult = await redactText(registry, ndjsonInput)
// Processes each JSON line independently
```
### JSON Detection Features
- **Key-based detection**: Enhanced accuracy using JSON key names as context
- **Path tracking**: Provides full JSON path for detected PII (e.g., `$.user.email`)
- **Nested objects**: Recursive detection in deeply nested structures
- **NDJSON support**: Line-by-line processing for streaming data
- **Type safety**: Validates JSON structure before processing
## š MCP (Model Context Protocol) Integration
Noren provides specialized support for MCP servers that communicate via JSON-RPC over stdio. This is particularly useful for AI tools like Claude Code that need to process communication with external services while protecting sensitive data.
### MCP Transform Stream
For real-time stdio processing in MCP servers:
```typescript
import {
Registry,
createMCPRedactionTransform,
redactJsonRpcMessage
} from '@himorishige/noren-core'
// Create registry with comprehensive PII detection
const registry = new Registry({
defaultAction: 'mask',
validationStrictness: 'fast', // Optimized for real-time processing
enableJsonDetection: true,
rules: {
email: { action: 'mask' },
api_key: { action: 'remove' },
jwt_token: { action: 'tokenize' }
},
hmacKey: 'mcp-server-redaction-key-32-chars-minimum-length-required'
})
// Create MCP-optimized transform stream
const transform = createMCPRedactionTransform({
registry,
policy: { defaultAction: 'mask' },
lineBufferSize: 64 * 1024
})
// Process stdio communication
await process.stdin
.pipeThrough(transform)
.pipeTo(process.stdout)
```
### JSON-RPC Message Processing
For processing individual JSON-RPC messages:
```typescript
// Process a JSON-RPC request
const request = {
jsonrpc: '2.0',
method: 'getUserProfile',
params: {
email: 'user@company.com',
phone: '+1-555-123-4567'
},
id: 1
}
const redacted = await redactJsonRpcMessage(request, { registry })
console.log(redacted)
// Output: {
// jsonrpc: '2.0',
// method: 'getUserProfile',
// params: {
// email: '[REDACTED:email]',
// phone: 'ā¢ā¢ā¢-ā¢ā¢ā¢-ā¢ā¢ā¢ā¢'
// },
// id: 1
// }
```
### MCP Server Proxy Example
Create a proxy server that automatically redacts PII from stdio communication:
```javascript
#!/usr/bin/env node
import { Registry, createMCPRedactionTransform } from '@himorishige/noren-core'
import { Readable, Writable } from 'node:stream'
class MCPRedactionProxy {
constructor(options = {}) {
this.registry = new Registry({
defaultAction: 'mask',
enableJsonDetection: true,
validationStrictness: 'fast'
})
}
async start() {
const inputStream = Readable.toWeb(process.stdin)
const outputStream = Writable.toWeb(process.stdout)
const transform = createMCPRedactionTransform({
registry: this.registry,
policy: { defaultAction: 'mask' }
})
await inputStream
.pipeThrough(transform)
.pipeTo(outputStream)
}
}
// Start the proxy
const proxy = new MCPRedactionProxy()
await proxy.start()
```
### MCP Use Cases
**1. AI Assistant Communication**
- Protect user data in Claude Code AI interactions
- Redact PII from external API communications
- Safe logging of AI model conversations
**2. Development Tools Integration**
- IDE extensions with PII protection
- Code analysis tools with privacy features
- Debug logging with automatic data sanitization
**3. CI/CD Pipeline Protection**
- Build logs with PII redaction
- Test data anonymization
- Environment variable protection
### MCP Utilities
The library also provides utility functions for MCP processing:
```typescript
import {
parseJsonLines,
isValidJsonRpcMessage,
extractSensitiveContent,
containsJsonRpcPattern,
getMessageType
} from '@himorishige/noren-core'
// Parse line-delimited JSON messages
const messages = parseJsonLines(ndjsonString)
// Validate JSON-RPC message format
if (isValidJsonRpcMessage(message)) {
const type = getMessageType(message) // 'request' | 'response' | 'notification' | 'error'
}
// Extract potentially sensitive content
const sensitiveContent = extractSensitiveContent(jsonRpcMessage)
```
## š API Reference
### `Registry`
Main class for PII detection and configuration.
#### Constructor Options
```typescript
interface RegistryOptions {
defaultAction?: 'mask' | 'remove' | 'tokenize'
rules?: Record<string, { action: Action, preserveLast4?: boolean }>
hmacKey?: string // Required for tokenization
environment?: 'production' | 'development' | 'test'
allowDenyConfig?: AllowDenyConfig
enableConfidenceScoring?: boolean
enableJsonDetection?: boolean // New: Enable JSON/NDJSON processing
sensitivity?: 'strict' | 'balanced' | 'relaxed'
contextHints?: string[] // Keywords to improve detection
validationStrictness?: 'fast' | 'balanced' | 'strict' // v0.6.0+: Context validation level
}
```
#### Methods
- `use(detectors, maskers, contextHints?)`: Add plugins
- `detect(text, contextHints?)`: Detect PII (returns hits)
- `maskerFor(type)`: Get masker for PII type
### `redactText(registry, input, overrides?)`
Process text and apply redaction rules.
### `createRedactionTransform(registry, overrides?)`
Create transform stream for large data processing.
## ā” Performance
### Benchmarks (v0.5.0)
- **Bundle Size**: 124KB optimized distribution
- **Processing Speed**: 102,229 operations/second (0.0098ms per iteration)
- **Memory Efficiency**: Object pooling with automatic cleanup
- **TypeScript Codebase**: 1,782 lines (40%+ reduction from v0.4.x)
- **API Surface**: 14 exports (65% reduction for better tree-shaking)
### Best Practices
1. **Reuse Registry instances** - avoid creating new ones frequently
2. **Use streams** for large data processing
3. **Disable confidence scoring** for maximum performance
4. **Pre-compile patterns** by loading plugins at startup
## š Security Considerations
### HMAC Keys
- **Minimum 32 characters** required (enforced in v0.5.0)
- Store in environment variables, never in code
- Use different keys per environment
- Rotate keys regularly
- Base64URL token format for better security
### Memory Safety
- Automatic object pooling reduces GC pressure
- Sensitive data is cleared from memory after processing
- Configurable limits prevent DoS attacks
## š Development Tools
For advanced features like benchmarking and A/B testing:
```bash
npm install @himorishige/noren-devtools
```
See [@himorishige/noren-devtools](../noren-devtools) for development and testing tools.
## š Version History
### v0.6.0 (Latest) - Advanced Validation & Architecture Optimization
**šØ Breaking Changes:**
- **Network detection separation**: IPv4/IPv6/MAC detection moved to `@himorishige/noren-plugin-network`
- **Smaller core bundle**: 35% reduction in core package size by removing network patterns
- **Plugin-based architecture**: Better modularity and optional feature loading
**š”ļø New Features:**
- **Advanced validation system**: Context-aware false positive filtering with 3 strictness levels (`fast`/`balanced`/`strict`)
- **Plugin validation integration**: Automatic validation for plugin-detected PII types with seamless inheritance
- **šÆšµ Enhanced Japanese language support**: Specialized validators and expanded context keywords for improved accuracy
- **š Debug utilities**: New `debugValidation()` function for detailed validation analysis
- **ā” Performance optimized**: Validation adds minimal overhead while significantly reducing false positives
- **šÆ Context-aware filtering**: Smart detection of test data, examples, and weak contexts
- **š Backward compatible**: All existing APIs work without changes (except network detection)
**š¦ Migration Guide:**
```typescript
// Before v0.6.0 (network detection included)
const result = await redactText(registry, 'IP: 192.168.1.1')
// v0.6.0+ (install network plugin)
npm install @himorishige/noren-plugin-network
import * as networkPlugin from '@himorishige/noren-plugin-network'
registry.use(networkPlugin.detectors, networkPlugin.maskers)
const result = await redactText(registry, 'IP: 192.168.1.1')
```
### v0.5.0 - Performance & Structured Data Support
- **JSON/NDJSON detection**: Native support for structured data with key-based matching
- **Prefilter optimization**: Fast screening reduces processing time for non-PII text
- **77% code reduction**: Streamlined from 8,153 to 1,782 lines
- **Single-pass detection**: Unified pattern matching for better performance
- **Optimized IPv6 parser**: 31% size reduction with enhanced validation
- **Streamlined Hit Pool**: 47% size reduction with object pooling
- **Reduced API surface**: 65% fewer exports for better tree-shaking
- **Enhanced security**: Stricter boundaries and improved validation
- **Code quality improvements**: Full TypeScript strict mode compliance
### v0.4.0 - Confidence Scoring & Advanced Features
- Added confidence scoring system
- Environment-aware processing
- Enhanced HMAC security with 32-character minimum
- Development tools package separation
## š License
MIT License - see [LICENSE](../../LICENSE) for details.
---
**Part of the [Noren](../../README.md) PII protection suite**