UNPKG

hana-cli

Version:
387 lines (290 loc) 11.6 kB
# duplicateDetection > Command: `duplicateDetection` > Category: **Analysis Tools** > Status: Production Ready ## Description Finds duplicate records in HANA tables using various matching strategies. It supports exact matching, fuzzy matching with similarity thresholds, and partial key matching to identify near-duplicates. ### What Are Duplicate Records? **Duplicate records** are multiple entries in a table that represent the same real-world entity but were entered separately. Common examples: - Same customer entered twice with slightly different names (John Smith vs. Jon Smith) - Same product created multiple times due to system errors - Duplicate transactions from failed batch retries - Data imported twice due to incomplete cleanup ### Why Is Duplicate Detection Critical? Duplicate data creates significant problems across your organization: **Data Quality Issues:** - **False Uniqueness**: Records that should be unique (customers, products) appear multiple times - **Skewed Metrics**: Counts, aggregations, and statistics become inaccurate - **Broken Relationships**: Foreign key references may point to wrong duplicate copies - **Data Inconsistency**: Updates to one copy don't reflect in other duplicates **Business Impact:** - **Incorrect Revenue**: Duplicate customer records inflate customer counts and revenue figures - **Invalid Analytics**: Reports show wrong trends, patterns, and insights - **Marketing Waste**: Marketing campaigns target duplicate customer records unnecessarily - **Compliance Risk**: Regulations (GDPR, CCPA) require accurate, non-redundant personal data - **Loss of Trust**: Duplicate billing or communications damage customer relationships - **Decision Errors**: Leadership makes decisions based on inflated or inaccurate data **Operational & Financial Impact:** - **Processing Waste**: Systems process duplicate records unnecessarily (storage, memory, CPU) - **Storage Growth**: Database grows unnecessarily with redundant data - **Manual Cleanup Costs**: Requires time-consuming manual review and merging - **Integration Failures**: Other systems reject or duplicate data when integrating duplicates - **Customer Support Issues**: Customers report receiving duplicate communications or bills - **System Performance**: More records mean slower queries and reports **Common Real-World Scenarios:** 1. **E-commerce**: Customer "John Smith" entered as "Jon Smith" and "John Smyth" → duplicate orders and shipping 2. **Healthcare**: Patient registered twice under slightly different spellings → medication overdose risk 3. **CRM**: Company "ABC Corp" and "ABC Corporation" tracked as different accounts → lost sales tracking 4. **Finance**: Same invoice processed twice → double-counting revenue 5. **Manufacturing**: Part number "A001" and "A-001" treated as different items → inventory mismatch ### How Duplicate Detection Helps #### 1. Data Quality Assurance ```bash # Identify duplicate customers by key columns hana-cli duplicateDetection \ --table CUSTOMERS \ --keyColumns CUSTOMER_EMAIL \ --mode exact ``` Find exact duplicates so you can decide which record to keep. #### 2. Fuzzy Matching for Near-Duplicates ```bash # Find similar customer names (typos, variations) hana-cli duplicateDetection \ --table CUSTOMERS \ --keyColumns FIRST_NAME,LAST_NAME \ --mode fuzzy \ --threshold 0.85 ``` Discover records that are similar but not identical (85% match threshold). #### 3. Post-Migration Validation ```bash # Ensure migration didn't create duplicates hana-cli duplicateDetection \ --table PRODUCTS \ --checkColumns PRODUCT_SKU,PRODUCT_NAME \ --limit 100000 ``` Verify data integrity after system migration or import. #### 4. Merge Strategy Planning ```bash # Generate detailed duplicate report for analysis hana-cli duplicateDetection \ --table SUPPLIERS \ --keyColumns SUPPLIER_NAME,COUNTRY \ --mode fuzzy \ --threshold 0.90 \ --format json \ --output duplicates-analysis.json ``` Export duplicates for review and decision-making before merging. #### 5. Ongoing Monitoring ```bash # Regular duplicate checks as part of data governance hana-cli duplicateDetection \ --table vendor_contracts \ --checkColumns vendor_id,contract_number \ --mode exact \ --output daily-dups.csv ``` Monitor for new duplicates introduced by ongoing operations. ## Syntax ```bash hana-cli duplicateDetection [options] ``` ## Aliases - `dupdetect` - `findDuplicates` - `duplicates` ## Command Diagram ```mermaid graph TD A["🔷 hana-cli duplicateDetection"] A --> B["📋 Required Parameters"] B --> B1["-t, --table: Table to check"] B1 --> B2["-k, --keyColumns: Key columns"] B2 --> C["📍 Schema & Connection"] C --> C1["-s, --schema: Schema for table"] C1 --> C2["-a, --admin: Connect via admin"] C2 --> C3["--conn: Connection file override"] C3 --> D["📊 Column Selection"] D --> D1["-c, --checkColumns: Columns to check"] D1 --> D2["-e, --excludeColumns: Exclude columns"] D2 --> E["🔬 Detection Options"] E --> E1["-m, --mode: exact/fuzzy/partial"] E1 --> E2["--threshold, --th: Match threshold"] E2 --> E3["-l, --limit: Max rows to check"] E3 --> E4["--timeout, --to: Timeout"] E4 --> F["🔢 Output & Format"] F --> F1["-o, --output: Report file"] F1 --> F2["-f, --format: Report format"] F2 --> F3["-p, --profile: CDS Profile"] F3 --> G["🔍 Troubleshooting"] G --> G1["--disableVerbose, --quiet"] G1 --> G2["-d, --debug: Debug mode"] G2 --> H["✅ Help: -h, --help"] style A fill:#0070C0,color:#fff,stroke:#fff,stroke-width:2px style H fill:#51CF66,color:#fff,stroke:#fff,stroke-width:2px ``` ## Parameters ### Positional Arguments This command has no positional arguments. ### Options | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--table` | `-t` | string | required | Name of the table to check | | `--schema` | `-s` | string | `**CURRENT_SCHEMA**` | Schema for table | | `--keyColumns` | `-k` | string | required | Comma-separated key columns for matching | | `--checkColumns` | `-c` | string | - | Columns to Check for Duplicates (comma-separated, optional) | | `--excludeColumns` | `-e` | string | - | Columns to Exclude from Check (comma-separated, optional) | | `--mode` | `-m` | string | `exact` | Detection Mode. Choices: `exact`, `fuzzy`, `partial` | | `--threshold` | `--th` | number | `0.95` | Similarity Threshold for Fuzzy Matching (0-1) | | `--output` | `-o` | string | - | Output Report File Path | | `--format` | `-f` | string | `summary` | Report Format. Choices: `json`, `csv`, `summary` | | `--limit` | `-l` | number | `10000` | Maximum Rows to Check | | `--timeout` | `--to` | number | `3600` | Operation Timeout in Seconds | | `--profile` | `-p` | string | - | CDS Profile | | `--help` | `-h` | boolean | - | Show help | ### Connection Parameters | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--admin` | `-a` | boolean | `false` | Connect via admin (default-env-admin.json) | | `--conn` | - | string | - | Connection Filename to override default-env.json | ### Troubleshooting | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--disableVerbose` | `--quiet` | boolean | `false` | Disable verbose output - removes extra output mainly intended for human-readable usage | | `--debug` | `-d` | boolean | `false` | Debug hana-cli itself by adding lots of intermediate details | For a complete list of parameters and options, use: ```bash hana-cli duplicateDetection --help ``` ## Detection Modes - **exact** (default) - Find identical values in key columns - **fuzzy** - Find similar values using Levenshtein distance and similarity threshold - **partial** - Find duplicates using only first key column ## Similarity Threshold The threshold determines what counts as a match in fuzzy mode: - `1.0` (100%) - Exact match only - `0.95` (95%) - Allow 1-2 character differences per field - `0.90` (90%) - Allow 3-4 character differences per field - `0.85` (85%) - More lenient matching ## Output Examples ### Summary (default) ```bash Duplicate Detection Report ========================== Total Rows: 10000 Unique Rows: 9850 Duplicate Groups: 75 Total Duplicates: 150 Duplicate Groups: Group: John||Smith, Records: 2, Match: 100% Group: John||Smyth, Records: 3, Match: 95% Group: Jane||Doe, Records: 2, Match: 100% ... ``` ### JSON ```json { "totalRows": 10000, "uniqueRows": 9850, "duplicateGroups": 75, "totalDuplicates": 150, "duplicates": [ { "matchKey": "John||Smith", "matchPercentage": 100, "count": 2, "records": [ { "rowNumber": 5, "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... } }, { "rowNumber": 1250, "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... } } ] } ] } ``` ### CSV ```csv Group,Rows,Similarity "John||Smith",2,100% "John||Smyth",3,95% "Jane||Doe",2,100% ``` ## Understanding Results ### Exact Matches All values in key columns are identical. These are definite duplicates. ### Fuzzy Matches Values are similar but not identical. The similarity percentage indicates how close they are. ### Partial Matches Duplicates identified based on a subset of key columns. ## Examples ### Exact duplicate detection ```bash hana-cli duplicateDetection --table CUSTOMERS \ --keyColumns "FIRST_NAME,LAST_NAME" \ --mode exact ``` ### Fuzzy duplicate detection with threshold ```bash hana-cli duplicateDetection --table CUSTOMERS \ --keyColumns "FIRST_NAME,LAST_NAME" \ --mode fuzzy \ --threshold 0.90 \ --format json \ --output duplicates.json ``` ### Exclude specific columns ```bash hana-cli duplicateDetection --table PRODUCTS \ --keyColumns "SKU" \ --excludeColumns "CREATED_DATE,MODIFIED_DATE" \ --limit 50000 ``` ### Partial matching ```bash hana-cli duplicateDetection --table SUPPLIERS \ --keyColumns "COMPANY_NAME" \ --mode partial ``` ## Handling Duplicates After identifying duplicates, you can: 1. **Report Only** - Generate report and review manually 2. **Tag Records** - Add a flag/status column to mark duplicates 3. **Merge Records** - Combine duplicate records into one 4. **Delete Duplicates** - Remove duplicate entries (keep first occurrence) 5. **Review Process** - Use data steward process to determine action Example workflow: ```bash # Step 1: Identify fuzzy duplicates hana-cli duplicateDetection --table CUSTOMERS \ --keyColumns "FIRST_NAME,LAST_NAME,EMAIL" \ --mode fuzzy --threshold 0.92 \ --format json --output duplicates.json # Step 2: Review and manually validate # (Review duplicates.json and create merge/delete list) # Step 3: Execute cleanup # (Use data governance process or scripts) ``` ## Return Codes - `0` - Detection completed successfully - `1` - Detection error or database connection issue ## Performance Tips 1. Use `exact` mode for better performance on large tables 2. Use `--limit` to test on a subset first 3. Specify key columns prudently 4. Use `--excludeColumns` to skip irrelevant columns 5. Increase `--threshold` for faster fuzzy matching ## Related Commands - `dataValidator` - Validate data against business rules - `dataProfile` - Generate statistical profiles See the [Commands Reference](../all-commands.md) for other commands in this category. ## See Also - [Category: Analysis Tools](..) - [All Commands A-Z](../all-commands.md)