agentic-data-stack-community
Version:
AI Agentic Data Stack Framework - Community Edition. Open source data engineering framework with 4 core agents, essential templates, and 3-dimensional quality validation.
251 lines (204 loc) • 12.9 kB
Markdown
# Task: Profile Data
## Overview
Conducts comprehensive data profiling to understand data characteristics, quality patterns, and structural properties. Provides foundational insights for quality rule development, data governance, and business intelligence initiatives through automated and manual profiling techniques.
## Prerequisites
- Access to data sources and target datasets
- Data sampling strategies and access permissions
- Profiling tool selection and configuration
- Business context and domain knowledge
- Data privacy and security requirements for profiling activities
## Dependencies
- Templates: `data-profiling-tmpl.yaml`, `profiling-report-tmpl.yaml`
- Tasks: `create-data-contract.md`, `gather-requirements.md`
- Checklists: `data-profiling-checklist.md`
## Steps
### 1. **Profiling Strategy and Planning**
- Define profiling objectives and scope based on business requirements
- Select appropriate data sources and datasets for profiling
- Design sampling strategies to ensure representative profiling results
- Plan profiling execution approach and resource allocation
- **Validation**: Profiling strategy approved by stakeholders and aligns with objectives
### 2. **Structural Data Profiling**
- Analyze data schema, field definitions, and data types
- Identify table structures, relationships, and dependencies
- Document data lineage and source system characteristics
- Assess data volume, growth patterns, and storage requirements
- **Quality Check**: Structural profiling provides complete data architecture understanding
### 3. **Content Data Profiling**
- Analyze data value distributions, patterns, and ranges
- Identify unique values, cardinality, and value frequencies
- Detect null values, missing data patterns, and completeness rates
- Assess data formats, encoding, and standardization levels
- **Validation**: Content profiling reveals data characteristics and quality patterns
### 4. **Quality Pattern Analysis**
- Identify data quality issues and anomaly patterns
- Analyze data consistency across systems and time periods
- Detect duplicate records and data redundancy patterns
- Assess data accuracy through format and business rule validation
- **Quality Check**: Quality analysis identifies specific improvement opportunities
### 5. **Business Rule Discovery**
- Discover implicit business rules and data relationships
- Identify data constraints and validation requirements
- Analyze business entity relationships and hierarchies
- Document domain-specific patterns and business logic
- **Validation**: Business rule discovery validated with domain experts
### 6. **Statistical and Trend Analysis**
- Perform statistical analysis of data distributions and correlations
- Identify trends, seasonality, and temporal patterns in data
- Analyze data growth rates and usage patterns
- Assess data stability and volatility characteristics
- **Quality Check**: Statistical analysis provides insights for forecasting and planning
### 7. **Profiling Report Generation and Validation**
- Generate comprehensive profiling reports with findings and recommendations
- Create executive summaries and technical detail reports
- Validate profiling results with business stakeholders and domain experts
- Document profiling methodology and assumptions
- **Final Validation**: Profiling results validated and accepted by stakeholders
## Interactive Features
### Dynamic Profiling Dashboard
- **Real-time profiling** with live data analysis and visualization
- **Interactive exploration** with drill-down capabilities and filtering
- **Comparative analysis** between different datasets and time periods
- **Anomaly detection** with automated highlighting of unusual patterns
### Automated Profiling Workflows
- **Scheduled profiling** with automated execution and report generation
- **Incremental profiling** tracking changes and evolution over time
- **Alert generation** for significant changes in data characteristics
- **Profile comparison** showing differences between profiling runs
### Collaborative Profiling Platform
- **Business validation** with stakeholder review and feedback capabilities
- **Domain expert input** for interpreting profiling results and patterns
- **Annotation system** for documenting business context and explanations
- **Knowledge sharing** across teams and profiling initiatives
## Outputs
### Primary Deliverable
- **Data Profiling Report** (`data-profiling-report.md`)
- Comprehensive analysis of data characteristics and quality patterns
- Statistical summaries and distribution analysis
- Business rule discoveries and validation requirements
- Quality improvement recommendations and action items
### Supporting Artifacts
- **Profiling Dataset** - Detailed profiling results with metadata and statistics
- **Quality Assessment** - Data quality scoring and issue identification
- **Business Rule Catalog** - Discovered business rules and validation requirements
- **Profiling Methodology** - Documentation of profiling approach and assumptions
## Success Criteria
### Profiling Completeness and Accuracy
- **Complete Coverage**: All critical data sources and fields profiled comprehensively
- **Statistical Accuracy**: Profiling results statistically representative of data populations
- **Business Relevance**: Profiling insights directly applicable to business decisions
- **Quality Insights**: Clear identification of data quality issues and improvement opportunities
- **Actionable Recommendations**: Specific, implementable recommendations for data improvement
### Validation Requirements
- [ ] Profiling covers all critical data sources and business-relevant fields
- [ ] Statistical analysis provides representative insights with appropriate confidence levels
- [ ] Business stakeholders validate profiling results and interpretations
- [ ] Quality issues identified with specific remediation recommendations
- [ ] Profiling methodology documented and replicable for future analysis
- [ ] Profiling results integrated with data governance and quality frameworks
### Evidence Collection
- Profiling execution logs and methodology documentation
- Statistical validation of profiling accuracy and representativeness
- Business stakeholder validation and feedback on profiling results
- Quality issue documentation with evidence and impact assessment
- Recommendation validation and implementation planning documentation
## Data Profiling Dimensions
### Structural Profiling
- **Schema Analysis**: Data types, field definitions, constraints, relationships
- **Volume Analysis**: Record counts, field populations, data growth patterns
- **Relationship Analysis**: Foreign keys, hierarchies, dependencies
- **Metadata Analysis**: Documentation, lineage, source system characteristics
### Content Profiling
- **Value Analysis**: Distributions, ranges, patterns, uniqueness
- **Format Analysis**: Data formats, encoding, standardization
- **Completeness Analysis**: Null rates, missing patterns, population coverage
- **Consistency Analysis**: Cross-system comparisons, standardization assessment
### Quality Profiling
- **Accuracy Assessment**: Format validation, range checks, business rule compliance
- **Completeness Assessment**: Missing value analysis and impact evaluation
- **Consistency Assessment**: Cross-system and temporal consistency analysis
- **Validity Assessment**: Data type compliance and constraint validation
### Business Profiling
- **Business Rule Discovery**: Implicit rules and constraints identification
- **Domain Analysis**: Business entity relationships and hierarchies
- **Usage Pattern Analysis**: Access patterns and business value assessment
- **Stakeholder Analysis**: Data ownership and stewardship identification
## Profiling Techniques and Methods
### Automated Profiling Techniques
- **Statistical Profiling**: Automated calculation of descriptive statistics
- **Pattern Recognition**: Automated identification of data patterns and formats
- **Anomaly Detection**: Machine learning-based identification of unusual patterns
- **Rule Discovery**: Automated discovery of data relationships and constraints
### Manual Profiling Techniques
- **Expert Review**: Domain expert analysis of profiling results
- **Business Validation**: Stakeholder review and interpretation of patterns
- **Quality Assessment**: Manual validation of automated profiling results
- **Context Analysis**: Business context integration with profiling insights
### Sampling Strategies
- **Random Sampling**: Statistical representative sampling for large datasets
- **Stratified Sampling**: Ensuring representation across different data segments
- **Temporal Sampling**: Time-based sampling for trend and seasonality analysis
- **Purposive Sampling**: Targeted sampling for specific business questions
### Profiling Tools and Technologies
- **Open Source Tools**: Apache Griffin, DataCleaner, OpenRefine
- **Commercial Tools**: Talend Data Quality, Informatica Data Quality, IBM InfoSphere
- **Cloud Native Tools**: AWS Glue DataBrew, Azure Data Factory, Google Cloud Dataprep
- **Custom Solutions**: Python/R scripts, SQL-based profiling, Spark applications
## Profiling Analysis Framework
### Statistical Analysis
- **Descriptive Statistics**: Mean, median, mode, standard deviation, percentiles
- **Distribution Analysis**: Histograms, frequency distributions, outlier identification
- **Correlation Analysis**: Relationships between fields and data dependencies
- **Trend Analysis**: Temporal patterns and seasonal variations
### Quality Analysis
- **Completeness Analysis**: Missing value patterns and population coverage
- **Accuracy Analysis**: Format compliance and business rule validation
- **Consistency Analysis**: Cross-system and temporal consistency assessment
- **Uniqueness Analysis**: Duplicate detection and cardinality assessment
### Business Analysis
- **Value Analysis**: Business value and critical data identification
- **Usage Analysis**: Data access patterns and business importance
- **Risk Analysis**: Data quality risks and business impact assessment
- **Opportunity Analysis**: Data enhancement and value creation opportunities
## Validation Framework
### Profiling Quality Assurance
1. **Sampling Validation**: Ensure profiling samples are representative
2. **Statistical Validation**: Verify statistical accuracy and confidence levels
3. **Business Validation**: Confirm profiling results align with business knowledge
4. **Technical Validation**: Validate profiling methodology and tool accuracy
5. **Completeness Validation**: Ensure all critical aspects are profiled
### Continuous Profiling Management
- Regular profiling execution to track data evolution
- Profiling result comparison and trend analysis
- Integration with data quality monitoring and governance
- Feedback collection for profiling methodology improvement
## Best Practices
### Profiling Planning
- Clearly define profiling objectives and success criteria
- Select representative data samples for accurate profiling
- Consider privacy and security requirements during profiling
- Plan for scalability and performance in large dataset profiling
### Analysis and Interpretation
- Combine automated profiling with business domain expertise
- Validate profiling results with multiple data sources and methods
- Focus on business-relevant insights and actionable recommendations
- Document assumptions and limitations of profiling analysis
### Stakeholder Engagement
- Involve business stakeholders in profiling planning and validation
- Communicate profiling results in business-relevant terms
- Create actionable recommendations with clear business value
- Establish feedback loops for continuous profiling improvement
## Risk Mitigation
### Common Pitfalls
- **Sampling Bias**: Unrepresentative samples leading to incorrect conclusions
- **Analysis Paralysis**: Over-analysis without actionable insights
- **Privacy Violations**: Inadequate protection of sensitive data during profiling
- **Tool Limitations**: Relying on single tools without validation
### Success Factors
- Clear objectives and success criteria for profiling initiatives
- Representative sampling strategies appropriate for data characteristics
- Business stakeholder involvement in profiling planning and validation
- Multiple profiling approaches and tools for validation and completeness
- Integration with broader data governance and quality management initiatives
## Notes
Data profiling is fundamental to understanding data characteristics and developing effective quality frameworks. Invest in comprehensive profiling that combines automated analysis with business domain expertise to generate actionable insights for data improvement and governance initiatives.