agentic-data-stack-community
Version:
AI Agentic Data Stack Framework - Community Edition. Open source data engineering framework with 4 core agents, essential templates, and 3-dimensional quality validation.
705 lines (606 loc) • 28 kB
YAML
# Data Quality Improvement Workflow
# Improving data quality in existing simple systems for small organizations
metadata:
workflow_id: data-quality-improvement
version: "1.0.0"
category: simple_brownfield
complexity: beginner
timeline: "2-3 weeks"
effort_hours: "35-55 hours"
risk_level: low
cost_estimate: "$0-75/month"
prerequisites:
- Access to existing data systems
- Understanding of business processes
- Basic data analysis skills
- Authority to implement process changes
target_audience:
- Organizations struggling with inconsistent data
- Teams wasting time on data cleanup
- Businesses with unreliable reporting due to poor data quality
description: |
Systematically improve data quality in existing simple systems without major
system changes. This workflow helps organizations identify, clean, and prevent
data quality issues using practical, low-risk approaches that can be implemented
by non-technical staff.
business_value:
primary_benefits:
- Increase confidence in business reports and decisions
- Reduce time spent on manual data cleanup
- Eliminate errors caused by inconsistent data
- Improve customer satisfaction with accurate information
- Enable better business intelligence and analytics
- Reduce operational costs from data-related mistakes
roi_metrics:
- Data cleanup time: 70% reduction in manual data cleaning
- Decision accuracy: Improved with reliable data foundation
- Customer satisfaction: Fewer errors in customer communications
- Operational efficiency: Reduced time spent investigating data discrepancies
phases:
assessment:
duration: "4-5 days"
description: "Identify and catalog data quality issues"
tasks:
- name: "Data quality audit"
duration: "2-3 days"
owner: "Data Quality Analyst"
deliverables:
- data_quality_assessment.xlsx
- issue_catalog.md
- impact_analysis.xlsx
assessment_framework:
data_quality_dimensions:
- name: "Completeness"
description: "Are all required fields filled in?"
common_issues:
- Missing customer contact information
- Incomplete product descriptions
- Empty required fields in forms
impact: "Prevents effective communication and analysis"
- name: "Accuracy"
description: "Is the data correct and up-to-date?"
common_issues:
- Outdated customer addresses
- Wrong product prices
- Incorrect calculations in spreadsheets
impact: "Leads to poor decisions and customer dissatisfaction"
- name: "Consistency"
description: "Is the same information represented the same way?"
common_issues:
- Company names spelled differently
- Product codes in various formats
- Dates in different formats
impact: "Makes reporting and analysis difficult"
- name: "Validity"
description: "Does the data conform to expected formats?"
common_issues:
- Invalid email addresses
- Phone numbers in wrong format
- Negative quantities or prices
impact: "Causes system errors and communication failures"
- name: "Timeliness"
description: "Is the data current and up-to-date?"
common_issues:
- Outdated inventory levels
- Old contact information
- Stale pricing information
impact: "Leads to operational problems and customer issues"
audit_process:
data_sampling:
- Select representative samples from each data source
- Use random sampling for large datasets
- Include edge cases and recent entries
- Document sampling methodology
quality_checks:
- Check for missing values in critical fields
- Identify duplicate records
- Validate data formats and types
- Look for outliers and unusual values
- Compare related data for consistency
issue_documentation:
- Catalog each type of quality issue found
- Estimate frequency and severity of problems
- Document business impact for each issue type
- Prioritize issues by impact and ease of fixing
- name: "Root cause analysis"
duration: "1-2 days"
owner: "Process Analyst"
deliverables:
- root_cause_analysis.md
- process_flow_diagrams.pdf
- improvement_opportunities.xlsx
common_root_causes:
data_entry_processes:
- Lack of data entry standards
- No validation at point of entry
- Manual typing errors
- Copy-paste mistakes
- Inadequate training on data entry procedures
system_limitations:
- No built-in validation rules
- Inability to enforce required fields
- Poor user interface design
- Lack of automated data checks
- No duplicate detection mechanisms
organizational_issues:
- Unclear data ownership responsibilities
- No data quality standards or procedures
- Insufficient time allocated for quality data entry
- Lack of consequences for poor data quality
- No regular data quality monitoring
analysis_techniques:
process_mapping:
- Map current data entry and maintenance processes
- Identify points where errors are introduced
- Document approval and verification steps
- Note where automation could help
stakeholder_interviews:
- Interview data entry staff about challenges
- Understand time pressures and constraints
- Identify training gaps and needs
- Gather suggestions for improvements
- name: "Create improvement plan"
duration: "1 day"
owner: "Project Manager"
deliverables:
- data_quality_improvement_plan.xlsx
- quick_wins_identification.md
- resource_requirements.md
improvement_strategies:
immediate_fixes:
- Clean existing data using standardization rules
- Remove obvious duplicates
- Fill in missing critical information
- Correct known errors and inconsistencies
prevention_measures:
- Implement data entry validation
- Create standardized procedures
- Provide training on data quality importance
- Set up regular monitoring and reporting
long_term_improvements:
- Automate data validation where possible
- Integrate systems to reduce manual entry
- Implement data governance procedures
- Create data quality metrics and reporting
data_cleaning:
duration: "1-2 weeks"
description: "Clean existing data and fix current quality issues"
tasks:
- name: "Standardize data formats"
duration: "3-4 days"
owner: "Data Cleaning Specialist"
deliverables:
- standardization_rules.xlsx
- cleaned_data_files
- cleaning_log.md
standardization_targets:
contact_information:
phone_numbers:
standard_format: "(555) 123-4567"
cleaning_rules:
- Remove all non-numeric characters except parentheses and hyphens
- Add area code if missing (use business location default)
- Format consistently with parentheses and hyphens
- Flag international numbers for manual review
email_addresses:
validation_rules:
- Must contain exactly one @ symbol
- Must have domain with at least one dot
- Convert to lowercase
- Trim whitespace
- Flag suspicious patterns for review
addresses:
standardization_approach:
- Use postal service abbreviations for states
- Standardize street type abbreviations (St, Ave, Blvd)
- Capitalize properly (Title Case)
- Remove extra spaces and punctuation
- Validate ZIP codes against known ranges
business_data:
company_names:
consistency_rules:
- Standardize legal entity suffixes (Inc., LLC, Corp.)
- Remove extra spaces and punctuation
- Use consistent capitalization
- Create master list of known variations
product_codes:
format_standards:
- Convert to uppercase
- Remove spaces and special characters
- Pad with zeros to consistent length
- Validate against known product list
dates:
standard_format: "MM/DD/YYYY"
validation_rules:
- Must be valid calendar date
- Cannot be in future (for historical data)
- Use consistent format across all systems
- Handle common input variations
cleaning_tools:
excel_techniques:
- TRIM() function to remove extra spaces
- UPPER(), LOWER(), PROPER() for capitalization
- FIND() and REPLACE() for pattern cleaning
- Data validation to prevent future errors
google_sheets_tools:
- Built-in data cleanup suggestions
- REGEX functions for pattern matching
- Conditional formatting to highlight issues
- Data validation rules
specialized_tools:
- OpenRefine for complex data cleaning
- Excel Power Query for automated cleaning
- Simple Python scripts for pattern matching
- Online tools for email and address validation
- name: "Remove duplicates and merge records"
duration: "2-3 days"
owner: "Data Analyst"
deliverables:
- duplicate_identification_report.xlsx
- merge_decisions_log.md
- cleaned_master_files
duplicate_detection:
identification_strategies:
exact_matches:
- Identical values in key fields
- Same name, address, and phone number
- Identical product codes or IDs
fuzzy_matches:
- Similar company names with slight variations
- Same person with nickname vs full name
- Address variations (Ave vs Avenue)
- Phone numbers with different formatting
detection_techniques:
manual_methods:
- Sort data by key fields to spot duplicates
- Use conditional formatting to highlight potential matches
- Create pivot tables to count occurrences
- Visual inspection of suspicious records
automated_methods:
- Use Excel's Remove Duplicates feature
- VLOOKUP or INDEX/MATCH to find matches
- Conditional formatting with COUNTIF functions
- Fuzzy matching using SOUNDEX or similar
merge_procedures:
decision_criteria:
- Most complete record wins (fewer empty fields)
- Most recent information takes precedence
- Higher quality data source preferred
- Business user judgment for complex cases
merge_process:
- Compare all fields between duplicate records
- Combine information to create single complete record
- Document merge decisions for audit trail
- Archive duplicate records before deletion
- Update references in related records
- name: "Fill missing data"
duration: "2-3 days"
owner: "Data Research Specialist"
deliverables:
- missing_data_analysis.xlsx
- data_completion_log.md
- updated_complete_records
missing_data_strategies:
research_and_lookup:
customer_information:
- Look up company websites for contact details
- Use LinkedIn for employee information
- Check business directories for addresses
- Call customers to verify/update information
product_information:
- Research manufacturer websites
- Look up product specifications online
- Check with suppliers for missing details
- Use industry catalogs and databases
inference_and_calculation:
derived_values:
- Calculate totals from line items
- Infer categories from product descriptions
- Estimate dates from related information
- Use business rules to fill standard values
default_values:
- Use most common values for categories
- Apply standard business terms
- Set default status values
- Use regional defaults for geographic data
controlled_blanking:
acceptable_gaps:
- Mark as "Unknown" where research isn't feasible
- Use "N/A" for inapplicable fields
- Leave optional fields empty rather than guess
- Document what cannot be completed
prevention:
duration: "3-5 days"
description: "Implement processes to prevent future data quality issues"
tasks:
- name: "Create data entry standards"
duration: "2 days"
owner: "Process Designer"
deliverables:
- data_entry_standards_manual.pdf
- field_specifications.xlsx
- validation_rules_documentation.md
standards_development:
field_level_standards:
required_fields:
- Identify fields that must never be empty
- Define what constitutes valid data for each field
- Create clear examples of acceptable entries
- Document business rules for special cases
format_specifications:
- Phone numbers: (999) 999-9999
- Dates: MM/DD/YYYY
- Currency: $9,999.99 (no symbols in data)
- Product codes: ABC-1234 (uppercase, hyphenated)
- Email: all lowercase, validated format
validation_rules:
- Numeric fields: acceptable ranges and constraints
- Text fields: maximum lengths and character restrictions
- Date fields: valid ranges and business rules
- Lookup fields: must match predefined lists
business_rules:
consistency_requirements:
- Customer addresses must match shipping addresses if same
- Product prices must be consistent with category standards
- Order dates cannot be in the future
- Discount percentages cannot exceed 100%
relationship_rules:
- Every order must have a valid customer
- All products must belong to valid categories
- Quantities must be positive numbers
- Totals must equal sum of line items
implementation_approach:
documentation_creation:
- Create simple, visual guides with examples
- Use screenshots and step-by-step instructions
- Include common mistakes and how to avoid them
- Provide quick reference cards for daily use
process_integration:
- Build standards into existing workflows
- Create checklists for data entry tasks
- Design review procedures for quality checking
- Establish escalation procedures for questions
- name: "Implement validation controls"
duration: "2-3 days"
owner: "Technical Implementer"
deliverables:
- validation_controls_implementation
- testing_results.xlsx
- user_guidance_documentation.md
validation_techniques:
spreadsheet_controls:
excel_data_validation:
- Drop-down lists for categories and statuses
- Number ranges for quantities and prices
- Date ranges for realistic date entry
- Text length limits for descriptions
- Custom formulas for business rule validation
conditional_formatting:
- Highlight cells with invalid data
- Color-code required fields
- Show warnings for unusual values
- Indicate incomplete records
formula_protection:
- Lock cells with calculations
- Protect worksheet structure
- Allow data entry only in designated cells
- Create templates with built-in validation
database_controls:
field_constraints:
- Required field settings
- Data type enforcement
- Check constraints for valid ranges
- Foreign key relationships for consistency
input_forms:
- Design forms with built-in validation
- Use dropdown lists for standardized entries
- Implement real-time validation feedback
- Create user-friendly error messages
simple_automation:
auto_formatting:
- Automatically format phone numbers
- Convert text to proper case
- Standardize date formats on entry
- Generate IDs and codes automatically
duplicate_prevention:
- Check for existing records before creating new ones
- Warn users about potential duplicates
- Require confirmation for similar entries
- Auto-populate based on existing data
- name: "Train staff on data quality"
duration: "1-2 days"
owner: "Training Coordinator"
deliverables:
- training_materials.pdf
- training_session_recordings.mp4
- data_quality_awareness_quiz.pdf
training_components:
data_quality_importance:
business_impact:
- How poor data affects customer service
- Impact on decision-making and reporting
- Cost of fixing vs preventing data problems
- Examples from organization's experience
personal_responsibility:
- Each person's role in maintaining quality
- How data quality affects their own work
- Pride in creating accurate, useful data
- Recognition for quality improvement efforts
practical_skills:
data_entry_techniques:
- Using validation tools effectively
- Double-checking critical information
- When and how to research missing data
- Proper use of copy-paste to avoid errors
quality_checking:
- How to spot potential data problems
- Simple techniques for verifying accuracy
- When to ask questions vs making assumptions
- How to report and escalate data issues
training_delivery:
session_format:
- Short, interactive sessions (30-45 minutes)
- Real examples from actual organizational data
- Hands-on practice with validation tools
- Q&A and discussion of specific challenges
ongoing_reinforcement:
- Monthly data quality tips
- Recognition for good data quality practices
- Regular reminders about standards
- Feedback on data quality improvements
monitoring:
duration: "2-3 days"
description: "Set up ongoing monitoring and quality reporting"
tasks:
- name: "Create quality metrics and dashboards"
duration: "1-2 days"
owner: "Reporting Specialist"
deliverables:
- data_quality_dashboard.xlsx
- quality_metrics_definitions.md
- monitoring_procedures.md
key_metrics:
completeness_metrics:
- "Percentage of records with all required fields filled"
- "Count of empty critical fields by data source"
- "Trend of completeness over time"
- "Records created without required information"
accuracy_metrics:
- "Percentage of records validated against external sources"
- "Count of data corrections made per month"
- "Customer complaints related to data accuracy"
- "System errors due to invalid data"
consistency_metrics:
- "Records with standardized formatting"
- "Duplicate records identified and resolved"
- "Variance in similar data across sources"
- "Adherence to data entry standards"
timeliness_metrics:
- "Average age of data when used"
- "Frequency of data updates"
- "Time between data changes and system updates"
- "Percentage of current vs outdated records"
dashboard_design:
visual_elements:
- Traffic light indicators for quality levels
- Trend charts showing improvement over time
- Top issues requiring attention
- Quality scores by data source or team
automated_features:
- Daily/weekly automatic updates
- Alert notifications for quality issues
- Drill-down capabilities for detailed analysis
- Export capabilities for sharing with management
- name: "Establish quality monitoring procedures"
duration: "1 day"
owner: "Quality Manager"
deliverables:
- monitoring_schedule.xlsx
- quality_review_procedures.md
- escalation_guidelines.md
monitoring_schedule:
daily_checks:
tasks:
- Review new data entries for completeness
- Check for obvious errors or inconsistencies
- Monitor validation rule violations
- Address immediate quality issues
duration: "15-20 minutes"
owner: "Data Entry Supervisor"
weekly_reviews:
tasks:
- Analyze quality metrics and trends
- Review top data quality issues
- Plan corrective actions for problems
- Communicate quality status to stakeholders
duration: "1-2 hours"
owner: "Data Quality Coordinator"
monthly_assessments:
tasks:
- Comprehensive quality audit of sample data
- Review and update data quality standards
- Analyze root causes of recurring issues
- Plan improvements and training needs
duration: "4-6 hours"
owner: "Quality Manager"
quality_procedures:
issue_identification:
detection_methods:
- Automated alerts from validation rules
- User reports of suspected problems
- Regular sampling and checking
- System error logs and exceptions
response_procedures:
immediate_response:
- Investigate reported issues within 24 hours
- Correct critical errors immediately
- Document issues and resolutions
- Communicate fixes to affected users
systematic_improvement:
- Analyze patterns in quality issues
- Update validation rules and procedures
- Provide additional training if needed
- Modify processes to prevent recurrence
success_metrics:
quantitative:
- "Data completeness improved to 95%+ for critical fields"
- "Duplicate records reduced by 90%+"
- "Data entry errors reduced by 80%+"
- "Time spent on data cleanup reduced by 70%+"
qualitative:
- "Increased confidence in business reports and decisions"
- "Reduced frustration with data-related problems"
- "Improved customer satisfaction due to accurate information"
- "Better team morale from working with quality data"
common_challenges:
legacy_data_issues:
challenge: "Years of poor quality data accumulated in systems"
solution: "Focus on cleaning most critical data first, improve gradually"
user_resistance:
challenge: "Staff resist additional validation steps that slow them down"
solution: "Show benefits, start with easy wins, provide adequate training"
resource_constraints:
challenge: "Limited time and budget for data quality improvements"
solution: "Focus on high-impact, low-cost improvements first"
ongoing_maintenance:
challenge: "Data quality degrades without continuous attention"
solution: "Build monitoring and maintenance into regular business processes"
tools_and_resources:
data_cleaning_tools:
- Excel with advanced formulas and functions
- Google Sheets with built-in data cleanup features
- OpenRefine for complex data transformation
- Online validation services for emails and addresses
validation_tools:
- Spreadsheet data validation features
- Simple database constraints and rules
- Form-based input with validation
- Automated formatting and standardization
monitoring_tools:
- Dashboard creation in Excel or Google Sheets
- Simple database reporting features
- Email alerts for critical quality issues
- Manual sampling and checking procedures
rollback_plan:
triggers:
- Data cleaning introduces new errors or corruption
- Validation rules prevent legitimate business operations
- Staff productivity significantly impacted by new procedures
- Quality improvements don't justify the effort and cost
rollback_procedure:
- Restore backup copies of original data
- Remove or modify problematic validation rules
- Return to previous data entry procedures
- Analyze what went wrong and plan alternative approach
- Communicate changes to all affected staff
next_steps:
continuous_improvement:
- Gradually increase validation sophistication
- Automate more quality checks and corrections
- Integrate with additional business systems
- Develop advanced reporting and analytics capabilities
expansion_opportunities:
- Apply quality improvements to additional data sources
- Implement master data management practices
- Create data governance policies and procedures
- Develop organization-wide data quality culture