sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
471 lines (399 loc) • 14.1 kB
Markdown
# Data Profiling Task
This task guides the systematic analysis and profiling of Salesforce data to
understand quality, patterns, and anomalies for informed data management
decisions.
## Purpose
Enable data quality analysts to:
- Assess data quality across Salesforce objects
- Identify data patterns and distributions
- Detect anomalies and outliers
- Establish data quality baselines
- Generate actionable insights for data improvement
## Prerequisites
- Access to Salesforce data and reporting tools
- Understanding of data quality dimensions
- Statistical analysis knowledge
- Data visualization tools access
- Stakeholder requirements and objectives
## Data Profiling Framework
### 1. Data Quality Dimensions
**Core Quality Metrics**
```yaml
Completeness:
Definition: Percentage of non-null values
Calculation: (Non-null count / Total count) * 100
Target: >95% for critical fields
Impact: High - affects business processes
Accuracy:
Definition: Correctness of data values
Validation: Format checks, range validation
Target: >99% for master data
Impact: Critical - drives decision making
Consistency:
Definition: Data uniformity across systems
Checks: Cross-reference validation
Target: >98% matching rate
Impact: Medium - affects integration
Uniqueness:
Definition: Absence of duplicate records
Detection: Fuzzy matching algorithms
Target: <1% duplicate rate
Impact: High - data integrity
Timeliness:
Definition: Data currency and freshness
Measurement: Last modified timestamps
Target: <24 hours for operational data
Impact: Medium - business relevance
Validity:
Definition: Data conforms to defined formats
Validation: Regex patterns, business rules
Target: >99% valid format
Impact: High - system functionality
```
### 2. Profiling Analysis Types
**Statistical Profiling**
```
Algorithm: Completeness Analysis
INPUT: object_name, field_name
PROCESS:
1. COUNT total_records in object
2. COUNT non_null_records in field_name
3. CALCULATE null_count = total_records - non_null_records
4. CALCULATE completeness_percentage = (non_null_records / total_records) * 100
5. ROUND completeness_percentage to 2 decimal places
6. RETURN analysis_results
OUTPUT: completeness_metrics
```
```
Algorithm: Value Distribution Analysis
INPUT: object_name, field_name
PROCESS:
1. GROUP records by field_name values (exclude null)
2. COUNT record_count for each distinct value
3. CALCULATE total_records in object
4. FOR each distinct value:
CALCULATE percentage = (record_count / total_records) * 100
5. ORDER results by record_count descending
6. RETURN distribution_analysis
OUTPUT: value_distribution_metrics
```
```
Algorithm: Data Range Analysis
INPUT: object_name, numeric_field_name
PROCESS:
1. FILTER records where numeric_field_name is not null
2. CALCULATE min_value, max_value, average_value
3. CALCULATE median_value (50th percentile)
4. CALCULATE standard_deviation
5. COMPILE range_statistics
6. RETURN statistical_analysis
OUTPUT: numeric_field_statistics
```
**Pattern Analysis**
```yaml
Format_Patterns:
Phone_Numbers:
- Pattern: "(xxx) xxx-xxxx"
- Variations: "+1-xxx-xxx-xxxx", "xxx.xxx.xxxx"
- Invalid: "123", "call me", "N/A"
Email_Addresses:
- Pattern: "user@domain.com"
- Validation: RFC 5322 compliance
- Common_Issues: Missing @, invalid domains
Postal_Codes:
- US_Pattern: "12345" or "12345-6789"
- International: Country-specific formats
- Validation: Country-based rules
Naming_Conventions:
Account_Names:
- Corporate: "Company Inc.", "Corporation LLC"
- Variations: Abbreviations, legal suffixes
- Inconsistencies: Case variations, punctuation
```
## Implementation Steps
### Step 1: Data Discovery and Inventory
**Object and Field Analysis**
```
Algorithm: Salesforce Object Discovery
INPUT: salesforce_connection
PROCESS:
1. GET all sobjects from salesforce describe
2. INITIALIZE object_inventory = empty list
3. FOR each object in sobjects:
a. IF object is queryable AND name does not end with "__History" THEN
CREATE object_info with:
- name, label, custom flag
- record_count from object query
b. ADD object_info to object_inventory
4. SORT object_inventory by record_count descending
5. RETURN sorted object_inventory
OUTPUT: discovered_objects_list
```
```
Algorithm: Object Schema Analysis
INPUT: object_name, salesforce_connection
PROCESS:
1. GET describe_result for object_name
2. INITIALIZE field_analysis = empty list
3. FOR each field in describe_result fields:
a. EXTRACT field metadata:
- name, type, length
- required = NOT nillable
- unique flag
- picklistValues if applicable
b. CREATE field_info with extracted metadata
c. ADD field_info to field_analysis
4. RETURN complete field_analysis
OUTPUT: object_schema_metadata
```
**Data Volume Assessment**
```
Algorithm: Data Volume Assessment
INPUT: objects_to_analyze, salesforce_connection
PROCESS:
1. INITIALIZE volume_analysis = empty dictionary
2. FOR each object_name in objects_to_analyze:
a. TRY:
- EXECUTE count query for total records
- EXECUTE count query for records modified in last 30 days
- IF total_count > 0 THEN
CALCULATE growth_rate = (recent_count / total_count) * 100
- ELSE
SET growth_rate = 0
- CALCULATE analysis_priority based on count and growth
- COMPILE volume_metrics
b. CATCH exceptions:
SET error information in results
3. RETURN volume_analysis
OUTPUT: data_volume_assessment
```
### Step 2: Quality Assessment Implementation
**Completeness Analysis**
```
Algorithm: Multi-Field Completeness Analysis
INPUT: object_name, fields_to_analyze
PROCESS:
1. INITIALIZE completeness_results = empty dictionary
2. BUILD aggregate query to count total records and non-null values for all fields
3. EXECUTE query and get result
4. EXTRACT total_count from result
5. FOR each field in fields_to_analyze:
a. EXTRACT non_null_count for field
b. CALCULATE null_count = total_count - non_null_count
c. IF total_count > 0 THEN
CALCULATE completeness_percentage = (non_null_count / total_count) * 100
d. ELSE
SET completeness_percentage = 0
e. CALCULATE quality_score based on completeness_percentage
f. COMPILE field_completeness_metrics
g. STORE in completeness_results[field]
6. RETURN completeness_results
OUTPUT: field_completeness_analysis
```
**Duplicate Detection**
```
Algorithm: Duplicate Record Detection
INPUT: object_name, matching_fields, similarity_threshold (default: 0.8)
PROCESS:
1. BUILD query to extract records with matching_fields and Id
2. EXECUTE query and get all records ordered by CreatedDate
3. INITIALIZE duplicates = empty list, processed_ids = empty set
4. FOR each record1 in records:
a. IF record1.Id already in processed_ids THEN continue
b. INITIALIZE potential_duplicates = [record1]
c. FOR each record2 in remaining records:
- IF record2.Id in processed_ids THEN continue
- CALCULATE similarity_score between record1 and record2
- IF similarity_score >= similarity_threshold THEN
ADD record2 to potential_duplicates
ADD record2.Id to processed_ids
d. IF potential_duplicates count > 1 THEN
CREATE duplicate_group with group_id, records, count, confidence
ADD to duplicates list
e. ADD record1.Id to processed_ids
5. RETURN duplicates
OUTPUT: duplicate_groups_list
```
```
Algorithm: Record Similarity Calculation
INPUT: record1, record2, fields_to_compare
PROCESS:
1. INITIALIZE total_similarity = 0, valid_comparisons = 0
2. FOR each field in fields_to_compare:
a. GET value1 from record1[field], normalize (trim, lowercase)
b. GET value2 from record2[field], normalize (trim, lowercase)
c. IF both value1 and value2 exist THEN
- CALCULATE similarity using string matching algorithm
- ADD similarity to total_similarity
- INCREMENT valid_comparisons
3. IF valid_comparisons > 0 THEN
RETURN total_similarity / valid_comparisons
4. ELSE
RETURN 0
OUTPUT: similarity_score (0.0 to 1.0)
```
### Step 3: Advanced Analytics
**Outlier Detection**
```python
def detect_outliers(self, object_name, numeric_fields):
"""Detect statistical outliers in numeric fields"""
import numpy as np
outlier_results = {}
for field in numeric_fields:
# Extract numeric data
query = f"SELECT {field} FROM {object_name} WHERE {field} != null"
records = self.sf.query_all(query)['records']
values = [r[field] for r in records if r[field] is not None]
if len(values) < 10: # Need sufficient data for analysis
continue
# Calculate statistics
values_array = np.array(values)
q1 = np.percentile(values_array, 25)
q3 = np.percentile(values_array, 75)
iqr = q3 - q1
# Define outlier bounds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Identify outliers
outliers = [v for v in values if v < lower_bound or v > upper_bound]
outlier_results[field] = {
'total_values': len(values),
'outlier_count': len(outliers),
'outlier_percentage': (len(outliers) / len(values) * 100),
'lower_bound': lower_bound,
'upper_bound': upper_bound,
'statistics': {
'mean': np.mean(values_array),
'median': np.median(values_array),
'std_dev': np.std(values_array),
'min': np.min(values_array),
'max': np.max(values_array)
}
}
return outlier_results
```
**Relationship Analysis**
```python
def analyze_relationships(self, parent_object, child_object, relationship_field):
"""Analyze parent-child relationship quality"""
# Check for orphaned records
orphan_query = f"""
SELECT COUNT() FROM {child_object} c
WHERE {relationship_field} NOT IN
(SELECT Id FROM {parent_object})
AND {relationship_field} != null
"""
orphan_count = self.sf.query(orphan_query)['totalSize']
# Check for missing relationships
total_child_query = f"SELECT COUNT() FROM {child_object} WHERE {relationship_field} != null"
total_child_count = self.sf.query(total_child_query)['totalSize']
# Relationship distribution
distribution_query = f"""
SELECT {relationship_field}, COUNT() as child_count
FROM {child_object}
WHERE {relationship_field} != null
GROUP BY {relationship_field}
ORDER BY child_count DESC
LIMIT 100
"""
distribution = self.sf.query_all(distribution_query)['records']
return {
'orphaned_records': orphan_count,
'total_child_records': total_child_count,
'integrity_percentage': ((total_child_count - orphan_count) / total_child_count * 100) if total_child_count > 0 else 100,
'relationship_distribution': distribution,
'avg_children_per_parent': sum(r['child_count'] for r in distribution) / len(distribution) if distribution else 0
}
```
## Visualization and Reporting
### Data Quality Dashboard
```
Algorithm: Quality Dashboard Generation
INPUT: profiling_results
PROCESS:
1. CREATE dashboard components:
- executive_summary from profiling_results
- object_scorecard from profiling_results
- trend_analysis from profiling_results
- actionable_insights from profiling_results
2. COMPILE comprehensive_dashboard
3. RETURN dashboard
OUTPUT: data_quality_dashboard
```
```
Algorithm: Executive Summary Creation
INPUT: profiling_results
PROCESS:
1. COUNT total_objects analyzed
2. SUM total_records across all objects
3. EXTRACT quality_scores from completeness analysis:
- FOR each object in results:
IF completeness data exists THEN
EXTRACT field quality_scores
ADD to quality_scores collection
4. CALCULATE overall_quality = average of all quality_scores
5. DETERMINE quality_grade based on overall_quality
6. IDENTIFY critical_issues from results
7. IDENTIFY improvement_opportunities from results
8. COMPILE executive_summary with all metrics
9. RETURN summary
OUTPUT: executive_quality_summary
```
### Automated Reporting
```yaml
Report_Types:
Executive_Summary:
Frequency: Weekly
Audience: Leadership
Content:
- Overall quality metrics
- Trend analysis
- Critical issues
- ROI impact
Technical_Report:
Frequency: Daily
Audience: Data teams
Content:
- Detailed field analysis
- Data anomalies
- Processing statistics
- Action items
Business_Impact:
Frequency: Monthly
Audience: Business users
Content:
- Process impact
- Data-driven insights
- Improvement recommendations
- Success metrics
```
## Continuous Monitoring
### Automated Profiling Pipeline
```
Algorithm: Automated Profiling Setup
INPUT: configuration, objects_to_monitor
PROCESS:
1. EXTRACT schedule from configuration (default: daily)
2. FOR each object_name in objects_to_monitor:
CREATE monitoring job for object_name
3. RETURN setup_confirmation
OUTPUT: automated_profiling_jobs
```
```
Algorithm: Monitoring Job Creation
INPUT: object_name, configuration
PROCESS:
1. CREATE job_configuration:
- object_name = provided object_name
- profiling_rules = configuration.rules
- alert_thresholds = configuration.thresholds
- notification_settings = configuration.notifications
2. SCHEDULE job with cron or task scheduler
3. RETURN job_configuration
OUTPUT: scheduled_profiling_job
```
## Success Criteria
✅ Data quality baseline established ✅ Profiling pipeline implemented ✅
Quality metrics dashboard created ✅ Anomaly detection active ✅ Automated
reporting configured ✅ Stakeholder insights delivered ✅ Improvement roadmap
defined ✅ Continuous monitoring operational