UNPKG

octocode-data-masker

Version:

A TypeScript library for masking sensitive data in strings, including PII, tokens, API keys, and more

488 lines (379 loc) 15.1 kB
# sensitive-data-masker A high-performance TypeScript library for detecting and masking sensitive data in strings. Protect PII, API keys, tokens, credentials, and other confidential information with intelligent masking algorithms and configurable accuracy levels. [![npm version](https://img.shields.io/npm/v/sensitive-data-masker.svg)](https://www.npmjs.com/package/sensitive-data-masker) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![TypeScript](https://img.shields.io/badge/TypeScript-5.8+-blue)](https://www.typescriptlang.org/) [![Node.js](https://img.shields.io/badge/Node.js-18.12+-green)](https://nodejs.org/) ## Features - 🛡️ **200+ Detection Patterns**: Comprehensive coverage for modern security needs - ⚡ **High Performance**: Optimized regex engine with pattern caching - 🎯 **Accuracy Control**: Configure detection sensitivity (high/medium/low) - 🔧 **Flexible Masking**: Smart partial masking that preserves readability - 📦 **Zero Dependencies**: Lightweight and secure - 🌍 **International Support**: Handles US, UK, Canadian, and international formats - 🔍 **Pattern Filtering**: Include or exclude specific pattern types - 📊 **Detailed Results**: Get match counts, positions, and masked values ## Installation ```bash npm install sensitive-data-masker ``` ```bash yarn add sensitive-data-masker ``` ## Quick Start ```typescript import { mask, hasSensitiveContent, getPatternMatches } from 'sensitive-data-masker'; // Basic usage - intelligent partial masking const text = 'My email is john@example.com and my SSN is 123-45-6789'; const result = mask(text); console.log(result.output); // "My email is **hn@example.c** and my SSN is **3-45-67**" console.log(result.found); // { email: 1, ssn: 1 } // Check if content contains sensitive data const isSensitive = hasSensitiveContent(text); console.log(isSensitive); // true // Get detailed pattern matches with positions const matches = getPatternMatches(text); console.log(matches); // [ // { // pattern: 'email', // matches: [{ match: 'john@example.com', startIndex: 12, endIndex: 27 }] // }, // { // pattern: 'ssn', // matches: [{ match: '123-45-6789', startIndex: 44, endIndex: 54 }] // } // ] ``` ## API Reference ### `mask(input: string, options?: MaskingOptions): MaskResult` Masks sensitive content in a string using intelligent partial masking. #### Options ```typescript interface MaskingOptions { maskChar?: string; // Character used for masking (default: '*') preserveLength?: boolean; // Preserve original length (default: false) excludePatterns?: string[]; // Patterns to exclude from masking onlyPatterns?: string[]; // Only mask these patterns matchAccuracy?: 'high' | 'medium' | 'low'; // Detection sensitivity } ``` #### Returns ```typescript interface MaskResult { output: string; // Masked string found: { [name: string]: number }; // Count of each pattern found matches: string[]; // Original matched values masked: string[]; // Masked versions of matches } ``` ### `hasSensitiveContent(input: string, options?): boolean` Quickly check if a string contains sensitive data without performing masking. ```typescript import { hasSensitiveContent } from 'sensitive-data-masker'; hasSensitiveContent('user@example.com'); // true hasSensitiveContent('hello world'); // false // With options hasSensitiveContent('sk-1234567890abcdef', { matchAccuracy: 'high', excludePatterns: ['genericId'] }); // true ``` ### `getPatternMatches(input: string, options?): PatternMatch[]` Get detailed information about all pattern matches including their positions. ```typescript import { getPatternMatches } from 'sensitive-data-masker'; const matches = getPatternMatches('Contact: admin@test.com and key: sk-123abc'); console.log(matches); // [ // { // pattern: 'email', // matches: [{ match: 'admin@test.com', startIndex: 9, endIndex: 22 }] // }, // { // pattern: 'openaiApiKey', // matches: [{ match: 'sk-123abc', startIndex: 33, endIndex: 41 }] // } // ] ``` ## Advanced Usage ### Custom Masking Options ```typescript import { mask } from 'sensitive-data-masker'; // Custom masking character const result = mask('API key: sk-1234567890abcdef', { maskChar: '#' }); console.log(result.output); // "API key: ##-1234567890ab##" // Preserve original length const result2 = mask('secret123', { preserveLength: true }); console.log(result2.output); // "*********" (full length masked) // Use high accuracy mode (fewer false positives) const result3 = mask('sk-1234567890abcdef', { matchAccuracy: 'high' }); console.log(result3.output); // "##-1234567890ab##" ``` ### Pattern Filtering ```typescript // Only mask specific patterns const result = mask('Email: user@test.com, API: sk-123', { onlyPatterns: ['email', 'openaiApiKey'] }); // Exclude certain patterns const result2 = mask('Email: user@test.com, UUID: 123e4567-e89b-12d3-a456-426614174000', { excludePatterns: ['uuid', 'genericId'] }); // Combine with accuracy control const result3 = mask(sensitiveText, { matchAccuracy: 'high', excludePatterns: ['uuid'] }); ``` ## Supported Pattern Categories The library detects sensitive data across **25 categories** with **200+ patterns**: ### 🆔 Personal Identifiable Information (PII) - Email addresses (multiple formats) - Phone numbers (US, International, E.164) - Social Security Numbers (US with various formats) - Driver's license numbers, Medical record numbers - Tax IDs (TIN/EIN), Canadian SIN, UK National Insurance Numbers ### ☁️ Cloud Provider Credentials - **AWS**: Access keys, secret keys, session tokens, account IDs - **AWS Resources**: EC2, S3, RDS, Lambda ARNs, VPC IDs - **Azure**: Subscription IDs, client secrets, resource IDs - **Google Cloud**: API keys, service account keys, project IDs ### 💳 Financial & Payment Services - Credit card numbers (Visa, MasterCard, Amex, Discover) - **Stripe**: Secret keys, publishable keys, webhook secrets - **PayPal**: Access tokens, client IDs - **Square**: Access tokens, application IDs - Bank account numbers (US routing numbers, IBAN) ### 🤖 AI Provider Credentials - **OpenAI**: API keys, organization IDs - **Anthropic/Claude**: API keys - **Google AI**: Gemini API keys, Vertex AI tokens - **Hugging Face**: Access tokens, API keys - **Other AI**: Groq, Perplexity, Replicate, Together AI ### 🔐 Authentication & Security - JWT tokens, Bearer tokens - OAuth access tokens, refresh tokens - API keys in headers (`X-API-Key`, `Authorization`) - Session IDs, CSRF tokens - Generic secret patterns in environment variables ### 🔧 Developer Tools & Services - **GitHub**: Personal access tokens, app tokens - **Slack**: Bot tokens, webhook URLs, app secrets - **Discord**: Bot tokens, webhook URLs - **Analytics**: Google Analytics, Mixpanel, Amplitude - **Monitoring**: Datadog, New Relic, Sentry keys ### 🗄️ Database & Storage - Database connection strings (PostgreSQL, MySQL, MongoDB) - **File Storage**: S3 bucket URLs, Azure Blob Storage - **CDN**: CloudFront URLs, Azure CDN - Redis connection strings, Elasticsearch URLs ### 🔑 Cryptographic Materials - RSA private keys, SSH private keys - EC private keys, DSA private keys - X.509 certificates, PGP private key blocks - JSON Web Keys (JWK), PKCS#8 keys ### 🌐 Network & Location - IPv4/IPv6 addresses, MAC addresses - Geographic coordinates (latitude/longitude) - Private network ranges, subnet masks - URL patterns with embedded secrets ### 📱 Communication Services - **Messaging**: Twilio, SendGrid, Mailgun keys - **Social Media**: Twitter, Facebook, Instagram tokens - **Email Services**: Mailchimp, Postmark, SparkPost - **SMS/Voice**: Nexmo, Plivo, MessageBird ### 🛠️ Infrastructure & DevOps - **Container Registries**: Docker Hub, ECR, GCR tokens - **CI/CD**: Jenkins, GitLab CI, CircleCI tokens - **Deployment**: Vercel, Netlify, Heroku tokens - **Monitoring**: PagerDuty, Datadog, New Relic ### 🏢 Enterprise & Business - **CRM**: Salesforce, HubSpot tokens - **E-commerce**: Shopify, WooCommerce keys - **Business Tools**: Slack, Microsoft Teams tokens - **Analytics**: Google Analytics, Adobe Analytics ### 🎯 Generic Patterns - UUID v4, Generic IDs - Base64 encoded secrets - Hex-encoded keys (32, 64, 128 bit) - Custom secret patterns in configuration files ### 🔍 URL & Reference Patterns - URLs with embedded tokens - Database connection URIs - API endpoints with keys - Webhook URLs with secrets ### 💾 Version Control & Code - Git repository URLs with tokens - Package manager tokens (npm, PyPI) - Container registry credentials - Code hosting platform tokens ## Pattern Accuracy Levels Control detection sensitivity to balance between security and false positives: ### High Accuracy - Most specific patterns with minimal false positives - Examples: AWS access keys with `AKIA` prefix, specific API key formats - Best for production environments ### Medium Accuracy (Default) - Balanced detection with reasonable false positive rates - Examples: Generic API keys, common secret patterns - Good for most use cases ### Low Accuracy - Broadest detection, may have higher false positive rates - Examples: Generic IDs, loose pattern matching - Useful for comprehensive scanning ```typescript // Use high accuracy for production const prodResult = mask(text, { matchAccuracy: 'high' }); // Use medium accuracy for development const devResult = mask(text, { matchAccuracy: 'medium' }); // Use low accuracy for comprehensive scanning const scanResult = mask(text, { matchAccuracy: 'low' }); ``` ## TypeScript Support Full TypeScript support with complete type definitions: ```typescript import { mask, hasSensitiveContent, getPatternMatches } from 'sensitive-data-masker'; import type { MaskResult, MaskingOptions } from 'sensitive-data-masker'; // Type-safe masking options const options: MaskingOptions = { maskChar: '#', matchAccuracy: 'high', excludePatterns: ['uuid'] }; const result: MaskResult = mask(text, options); ``` ## Real-World Examples ### Log File Sanitization ```typescript import { mask } from 'sensitive-data-masker'; const logEntry = ` [2024-01-15 10:30:45] INFO User john@company.com logged in [2024-01-15 10:31:12] DEBUG API call with key sk-1234567890abcdef [2024-01-15 10:31:15] ERROR Payment failed for card 4111-1111-1111-1111 [2024-01-15 10:31:20] WARN SSN in request: 123-45-6789 `; const sanitized = mask(logEntry); console.log(sanitized.output); // [2024-01-15 10:30:45] INFO User **hn@company.c** logged in // [2024-01-15 10:31:12] DEBUG API call with key **-1234567890ab** // [2024-01-15 10:31:15] ERROR Payment failed for card **11-1111-1111-11** // [2024-01-15 10:31:20] WARN SSN in request: **3-45-67** console.log(sanitized.found); // { email: 1, openaiApiKey: 1, creditCard: 1, ssn: 1 } ``` ### Configuration File Security ```typescript const config = ` DATABASE_URL=postgresql://user:password123@localhost:5432/db OPENAI_API_KEY=sk-1234567890abcdef1234567890abcdef STRIPE_SECRET_KEY=sk_live_abcdef123456 ADMIN_EMAIL=admin@company.com JWT_SECRET=super-secret-key-123 `; const result = mask(config); console.log(result.output); // DATABASE_URL=postgresql://user:**ssword1** @localhost:5432/db // OPENAI_API_KEY=**-1234567890abcdef1234567890ab** // STRIPE_SECRET_KEY=**_live_abcdef12** // ADMIN_EMAIL=**min@company.c** // JWT_SECRET=**per-secret-key-1** ``` ### Multi-Environment Setup ```typescript import { mask } from 'sensitive-data-masker'; // Production: Mask everything with high accuracy const prodResult = mask(sensitiveData, { matchAccuracy: 'high' }); // Development: Allow test emails but mask real API keys const devResult = mask(sensitiveData, { matchAccuracy: 'medium', excludePatterns: ['email'] }); // Testing: Only mask financial data const testResult = mask(sensitiveData, { onlyPatterns: ['creditCard', 'bankAccount', 'ssn'], matchAccuracy: 'high' }); ``` ### Data Pipeline Processing ```typescript import { hasSensitiveContent, mask } from 'sensitive-data-masker'; // Check if data needs processing function processBatch(records: string[]) { const results = records.map(record => { if (hasSensitiveContent(record)) { const masked = mask(record, { matchAccuracy: 'high' }); return { data: masked.output, hadSensitiveData: true, patternsFound: Object.keys(masked.found) }; } return { data: record, hadSensitiveData: false }; }); return results; } ``` ## Performance Considerations - **Optimized Regex Engine**: Patterns are compiled and cached on first use - **Single-Pass Processing**: Efficient string traversal with minimal overhead - **Memory Efficient**: No unnecessary string copies or allocations - **Pattern Filtering**: Use `onlyPatterns` when you know which types to look for - **Accuracy Optimization**: Higher accuracy modes are faster due to more specific patterns ```typescript // Optimize for specific use cases const emailsOnly = mask(text, { onlyPatterns: ['email'] }); // Faster const highAccuracy = mask(text, { matchAccuracy: 'high' }); // Faster, fewer false positives const comprehensive = mask(text, { matchAccuracy: 'low' }); // Slower, more thorough ``` ## Security Best Practices 1. **Always mask before logging**: Ensure sensitive data is masked before writing to logs 2. **Use appropriate accuracy**: Higher accuracy for production, lower for development/testing 3. **Store results securely**: The `matches` array contains original sensitive values 4. **Regular updates**: Keep the library updated for new pattern definitions 5. **Test your patterns**: Verify masking works correctly with your specific data formats 6. **Environment-specific config**: Use different settings for dev/staging/production ## Development ### Prerequisites - Node.js >= 18.12.0 - Yarn or npm ### Setup ```bash git clone https://github.com/bgauryy/sensitive-data-mask.git cd sensitive-data-mask yarn install ``` ### Commands ```bash yarn build # Build the library yarn dev # Build in watch mode yarn lint # Run ESLint yarn test # Run tests yarn typecheck # Run TypeScript compiler checks ``` ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ### Adding New Patterns 1. Choose the appropriate category file in `src/regexes/` 2. Add your pattern following the existing structure: ```typescript { name: 'myPattern', regex: /your-regex-here/gi, description: 'Description of what this detects', matchAccuracy: 'medium' // optional: 'high', 'medium', or 'low' } ``` 3. Run tests to ensure no regressions 4. Submit a PR with a clear description ## License MIT © [guybary](https://github.com/bgauryy) ## Security If you discover a security vulnerability, please email guybary@wix.com instead of using the issue tracker. --- **Made with ❤️ for developers who care about data security**