UNPKG

hana-cli

Version:
459 lines (324 loc) 12.9 kB
# dataProfile > Command: `dataProfile` > Category: **Analysis Tools** > Status: Early Access ## Description Generates comprehensive data quality metrics and statistical analysis for database tables. It analyzes NULL values, data distribution, uniqueness, and string patterns—useful for data quality assessment, understanding data characteristics, and identifying potential data issues. ### What is Data Profiling? **Data profiling** is the process of examining actual data in your database to understand its characteristics, quality, and structure. It goes beyond table schema to analyze: - **Data Distribution**: How data values are spread across the column - **Data Quality**: Presence of NULL values, duplicates, invalid formats - **Statistical Patterns**: Min/max values, averages, standard deviations - **Uniqueness**: How many distinct values exist (cardinality) - **Format Patterns**: String lengths, patterns, and data type compliance Think of it as an X-ray of your data that reveals what's actually there, not just what the schema says should be there. ### Why is Data Profiling Important? Data profiling provides critical insights that go beyond basic table structure: **Data Quality Assessment:** - **Hidden Issues**: Discover NULL values, invalid formats, or unexpected data types that violate assumptions - **Data Validity**: Identify strings where numbers should be, or negative values where only positives are expected - **Format Violations**: Find phone numbers, dates, or codes that don't match the expected format - **Range Violations**: Discover values outside acceptable ranges (age > 150, prices < 0) - **Completeness**: Understand what percentage of data is missing (NULL) **Understanding Your Data:** - **Cardinality Analysis**: Know how many unique values exist (is this really a "unique ID"?) - **Distribution Patterns**: Understand if data is evenly distributed or has hot spots - **Real-World Constraints**: Discover actual business rules vs. documented rules - **Data Type Reality**: Confirm that columns actually contain the data type they claim - **Time-Based Patterns**: Identify trends over time in data values **Business Impact:** - **Correct Reporting**: Base decisions on accurate understanding of data - **Risk Identification**: Uncover data quality issues before they cause problems - **Better Queries**: Write more efficient queries once you understand data distribution - **Integration Planning**: Know data quality requirements for system integrations - **Regulatory Compliance**: Document and validate data governance requirements - **Root Cause Analysis**: Understand why reports differ from expectations **Operational Benefits:** - **Migration Planning**: Understand data complexity before migrating to new systems - **Index Strategy**: Know which columns benefit most from indexing based on cardinality - **Performance Tuning**: Identify columns with skewed distributions (partitioning candidates) - **Data Cleanup**: Prioritize data quality improvements based on impact areas - **Documentation**: Create accurate data dictionaries with real statistics ### Use Cases for Data Profiling #### 1. New Project Kickoff ```bash # Profile all tables in a new schema hana-cli dataProfile --schema LEGACY_SYSTEM --cardinalityAnalysis --statisticalAnalysis ``` Understand the data you've inherited or integrated. #### 2. Data Quality Initiative ```bash # Identify columns with data quality issues hana-cli dataProfile --table CUSTOMERS \ --nullAnalysis \ --patternAnalysis \ --format json \ --output customer-quality-report.json ``` Find specific columns that need data cleansing. #### 3. Performance Tuning ```bash # Analyze cardinality for indexing decisions hana-cli dataProfile --table ORDERS \ --cardinalityAnalysis \ --columns ORDER_ID,CUSTOMER_ID,STATUS,ORDER_DATE ``` Identify which columns have high selectivity for indexes. #### 4. Data Governance Documentation ```bash # Generate statistical profile for data dictionary hana-cli dataProfile --table PRODUCTS \ --statisticalAnalysis \ --format csv \ --output products-statistics.csv ``` Create accurate documentation of real data characteristics. #### 5. Pre-Migration Validation ```bash # Compare profiles before and after migration hana-cli dataProfile --schema PRODUCTION --sampleSize 1000000 ``` Verify data integrity after system migration. #### 6. Report Accuracy Verification ```bash # Debug why summary report shows 0 customers hana-cli dataProfile --table CUSTOMERS --nullAnalysis --cardinalityAnalysis ``` Verify that key columns contain expected data. ## Syntax ```bash hana-cli dataProfile ``` ## Aliases - `prof` - `profileData` - `dataStats` ## Command Diagram ```mermaid graph TD Start([hana-cli dataProfile]) --> Input{Table Provided?} Input -->|No| Err[Exit with Error<br/>Table not found] Input -->|Yes| Connect[Connect to Database] Connect --> Schema{Schema uses CURRENT_SCHEMA token?} Schema -->|Yes| Resolve[Resolve Current Schema] Schema -->|No| UseSchema[Use Provided Schema] Resolve --> Columns[Load Table Columns] UseSchema --> Columns Columns --> Filter{Columns option specified?} Filter -->|Yes| Selected[Filter to Selected Columns] Filter -->|No| AllCols[Profile All Columns] Selected --> Analyze[Run Analyses<br/>NULL/Cardinality/Stats/Pattern] AllCols --> Analyze Analyze --> Output{Output Target} Output -->|Console| Print[Display summary/csv/json] Output -->|File path| Save[Write report file] Print --> Complete([Command Complete]) Save --> Complete style Start fill:#0092d1 style Complete fill:#2ecc71 style Input fill:#f39c12 style Schema fill:#f39c12 style Filter fill:#f39c12 style Output fill:#f39c12 style Err fill:#9b59b6 ``` ## Parameters ### Positional Arguments This command has no positional arguments. ### Options | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--table` | `-t` | string | - | Target Table Name. Required for successful execution. | | `--schema` | `-s` | string | `**CURRENT_SCHEMA**` | Target Schema Name. | | `--columns` | `-c` | string | - | Columns to Profile (comma-separated, optional). | | `--output` | `-o` | string | - | Output Report File Path. If omitted, output is printed to console. | | `--format` | `-f` | string | `summary` | Report Format. Choices: `json`, `csv`, `summary`. | | `--nullAnalysis` | `--na` | boolean | `true` | Include NULL Value Analysis. | | `--cardinalityAnalysis` | `--ca` | boolean | `true` | Include Cardinality Analysis. | | `--statisticalAnalysis` | `--sa` | boolean | `true` | Include Statistical Analysis (min, max, avg). | | `--patternAnalysis` | `--pa` | boolean | `false` | Include Pattern Analysis (length statistics). | | `--sampleSize` | `--ss` | number | `10000` | Maximum Rows to Analyze. | | `--timeout` | `--to` | number | `3600` | Operation Timeout in Seconds. | | `--profile` | `-p` | string | - | CDS Profile. | | `--help` | `-h` | boolean | - | Show help. | ### Connection Parameters | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--admin` | `-a` | boolean | `false` | Connect via admin (default-env-admin.json). | | `--conn` | - | string | - | Connection Filename to override default-env.json. | ### Troubleshooting | Option | Alias | Type | Default | Description | | --- | --- | --- | --- | --- | | `--disableVerbose` | `--quiet` | boolean | `false` | Disable Verbose output - removes extra output primarily intended for human-readable usage. | | `--debug` | `-d` | boolean | `false` | Debug hana-cli itself by adding lots of intermediate details. | For a complete list of parameters and options, use: ```bash hana-cli dataProfile --help ``` ## Output Metrics ### NULL Analysis - NULL count per column - NULL percentage - Completeness ratio (1 - NULL%) ### Cardinality Analysis - Distinct value count - Cardinality ratio (distinct / total) - Data uniqueness assessment ### Statistical Analysis - Minimum value - Maximum value - Average value - Applicable to numeric columns ### Pattern Analysis - Minimum string length - Maximum string length - Average string length - Useful for identifying data format issues ### Value Distribution - Top 5 most frequent values - Value frequency counts ## Output Formats ### Summary Format (Default) Console-friendly overview showing key metrics per column. ### JSON Format Structured format with complete metrics and metadata: ```json { "table": "EMPLOYEES", "rowCount": 10000, "columnCount": 15, "columns": { "SALARY": { "nullCount": 5, "distinctCount": 8942, "minValue": 25000, "maxValue": 250000, "avgValue": 85432.50, "topValues": [...] } } } ``` ### CSV Format Tabular format for analysis and comparison. ## Special Default Values | Token | Resolves To | Description | | --- | --- | --- | | `**CURRENT_SCHEMA**` | Current user schema | Default for `--schema` when not explicitly provided. | ## Interactive Mode This command can be used in interactive mode: ```bash hana-cli interactive --category data-tools ``` When choosing `dataProfile` interactively, `table` is prompted as required and `schema` is optional. Most advanced options are available but skipped by default unless explicitly provided. ## Examples ### 1. Basic Table Profile Quick overview of table data: ```bash hana-cli dataProfile -t EMPLOYEES ``` ### 2. Profile Specific Columns Analyze selected columns only: ```bash hana-cli dataProfile -t CUSTOMERS -c NAME,EMAIL,PHONE,ADDRESS ``` ### 3. Detailed JSON Report Export comprehensive profile: ```bash hana-cli dataProfile \ -t SALES_DATA \ -f json \ -o ./analysis/sales_profile.json ``` ### 4. Include Pattern Analysis Analyze string properties: ```bash hana-cli dataProfile \ -t PRODUCTS \ --patternAnalysis true \ -o ./analysis/product_patterns.json ``` ### 5. Schema-Specific Profile Profile table in non-default schema: ```bash hana-cli dataProfile -t CUSTOMERS -s BUSINESS_UNIT_1 ``` ### 6. CSV Format for Spreadsheet Export profile in CSV format: ```bash hana-cli dataProfile \ -t DATA \ -f csv \ -o ./analysis/data_profile.csv ``` ## Use Cases ### Data Quality Assessment Check overall quality of imported data: ```bash hana-cli dataProfile \ -t IMPORTED_DATA \ -f json \ -o ./quality_assessment.json ``` ### Data Discovery Understand new dataset characteristics: ```bash hana-cli dataProfile \ -t NEW_CUSTOMER_DATA \ -c CUSTOMER_ID,NAME,EMAIL,PHONE,ZIP_CODE ``` ### Identify Data Issues Find columns that might have problems: ```bash hana-cli dataProfile \ -t PRODUCTS \ --patternAnalysis true ``` ### Performance Analysis Understand data volume before optimization: ```bash hana-cli dataProfile -t LARGE_TABLE --sampleSize 50000 ``` ### Data Governance Document data characteristics for compliance: ```bash hana-cli dataProfile \ -t PROTECTED_DATA \ -f json \ -o ./governance/data_profile.json ``` ## Performance Considerations - **Sample size**: Reduce `--sampleSize` for faster profiling of huge tables - **Pattern analysis**: Only use `--patternAnalysis` when needed (string analysis is slower) - **Column filtering**: Profile only necessary columns - **Timeout**: Increase for large datasets with complex analysis ## Quality Assessment Matrix Use profile results to assess data quality: | Metric | Good | Warning | Bad | | --- | --- | --- | --- | | NULL % | < 5% | 5-20% | > 20% | | Cardinality | Near 100% | 50-100% | < 50% | | Min/Max Range | Reasonable | Extreme range | Outliers present | | Top values | Well distributed | Few dominant values | Single value dominates | ## Data Quality Red Flags - High NULL percentage (> 20%) - Unexpected NULL values in key columns - Extremely low cardinality (few distinct values) - Min/max values with obvious outliers - Single value dominating distribution - Empty strings mixed with NULLs ## Tips and Best Practices 1. **Profile before importing**: Understand source data quality 2. **Regular profiling**: Schedule periodic profiles for monitoring 3. **Compare profiles**: Run profiles on backup to identify changes 4. **Document findings**: Keep profile reports for reference 5. **Focus on key columns**: Profile business-critical columns first 6. **Use patterns**: Identify string format issues (zip codes, emails) 7. **Set appropriate samples**: Balance accuracy with performance ## Related Commands - **`dataValidator`** - Validate table data against business rules and constraints - **`duplicateDetection`** - Detect duplicate records using configurable matching strategies See the [Commands Reference](../all-commands.md) for other commands in this category. ## See Also - [Category: Analysis Tools](..) - [All Commands A-Z](../all-commands.md)