sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
460 lines (411 loc) • 15.8 kB
Markdown
# Duplicate Analysis Task
This task guides the comprehensive detection, analysis, and resolution of
duplicate records in Salesforce to maintain data quality and integrity.
## Purpose
Enable data quality analysts to:
- Detect duplicate records using advanced algorithms
- Analyze duplicate patterns and root causes
- Implement prevention strategies
- Develop automated duplicate management processes
- Maintain ongoing duplicate monitoring and resolution
## Prerequisites
- Access to Salesforce data and duplicate management tools
- Understanding of fuzzy matching algorithms
- Knowledge of business processes and data entry patterns
- Duplicate rule configuration permissions
- Data cleansing tools and techniques familiarity
## Duplicate Analysis Framework
### 1. Detection Methodology
**Multi-Level Duplicate Detection**
```yaml
Detection_Levels:
Exact_Match:
Criteria: Field values match exactly
Use_Cases: Simple data entry errors
Confidence: 100%
Processing: Fast, automated
Fuzzy_Match:
Criteria: Field values similar within threshold
Use_Cases: Variations in spelling, formatting
Confidence: 80-95%
Processing: Algorithm-based matching
Probabilistic_Match:
Criteria: Multiple fields contribute to match score
Use_Cases: Complex duplicate scenarios
Confidence: 60-90%
Processing: Machine learning models
Business_Rule_Match:
Criteria: Custom business logic
Use_Cases: Industry-specific duplicates
Confidence: Variable
Processing: Rule engine evaluation
```
**Detection Algorithms**
```
Algorithm: Duplicate Detection Engine Initialization
INPUT: configuration
PROCESS:
1. SET similarity_threshold from config (default: 0.85)
2. DEFINE available algorithms:
- exact: exact field matching
- fuzzy: fuzzy string matching
- levenshtein: edit distance calculation
- phonetic: sound-based matching
- token: token-based comparison
3. INITIALIZE detection engine with algorithms
OUTPUT: configured_duplicate_detection_engine
```
```
Algorithm: Duplicate Detection using Specified Algorithm
INPUT: records, match_fields, algorithm_type (default: 'fuzzy')
PROCESS:
1. VALIDATE algorithm_type exists in available algorithms
2. IF algorithm_type not found THEN
RAISE ValueError with algorithm name
3. SELECT detection_function for algorithm_type
4. EXECUTE detection_function with records and match_fields
5. FORMAT duplicate_results with algorithm information
6. RETURN formatted duplicate results
OUTPUT: duplicate_detection_results
```
```
Algorithm: Fuzzy String Matching for Duplicate Detection
INPUT: records, match_fields
PROCESS:
1. INITIALIZE duplicate_groups = empty list, processed_ids = empty set
2. FOR each record1 in records:
a. IF record1.Id already in processed_ids THEN continue
b. INITIALIZE potential_duplicates = [record1]
c. FOR each record2 in remaining records:
- IF record2.Id in processed_ids THEN continue
- CALCULATE similarity_score between record1 and record2
- IF similarity_score >= similarity_threshold THEN
ADD record2 to potential_duplicates
ADD record2.Id to processed_ids
d. IF potential_duplicates count > 1 THEN
CREATE duplicate_group with:
- master_candidate selection
- all duplicate records
- confidence_score calculation
- match_reasons analysis
ADD duplicate_group to duplicate_groups
e. ADD record1.Id to processed_ids
3. RETURN duplicate_groups
OUTPUT: fuzzy_match_duplicate_groups
```
```
Algorithm: Record Similarity Calculation
INPUT: record1, record2, fields_to_compare
PROCESS:
1. DEFINE field_weights: Name=0.4, Email=0.3, Phone=0.2, Website=0.1
2. INITIALIZE total_weight=0, weighted_similarity=0
3. FOR each field in fields_to_compare:
a. IF field has defined weight THEN
- GET value1 from record1[field], normalize (strip, lowercase)
- GET value2 from record2[field], normalize (strip, lowercase)
- IF both values exist THEN
CALCULATE similarity using fuzzy string matching
GET field_weight for field
ADD (similarity * field_weight) to weighted_similarity
ADD field_weight to total_weight
4. IF total_weight > 0 THEN
RETURN weighted_similarity / total_weight
5. ELSE
RETURN 0
OUTPUT: similarity_score (0.0 to 1.0)
```
### 2. Advanced Detection Techniques
**Machine Learning Duplicate Detection**
```
Algorithm: ML Duplicate Detector Initialization
INPUT: configuration
PROCESS:
1. INITIALIZE text_vectorizer with n-gram range (1,2)
2. CONFIGURE clustering_model with density-based clustering
- eps = 0.3 (neighborhood distance)
- min_samples = 2 (minimum cluster size)
3. RETURN configured ML detector
OUTPUT: ml_duplicate_detection_system
```
```
Algorithm: Text Feature Preparation for ML Analysis
INPUT: records, text_fields
PROCESS:
1. INITIALIZE feature_matrix = empty list, record_ids = empty list
2. FOR each record in records:
a. COMBINE text fields into single feature string:
- FOR each field in text_fields:
GET field value from record, normalize (strip, lowercase)
- JOIN all field values with spaces
b. ADD combined_text to feature_matrix
c. ADD record.Id to record_ids
3. VECTORIZE feature_matrix using TF-IDF transformation
4. RETURN tfidf_matrix and record_ids
OUTPUT: vectorized_features_and_record_mapping
```
```
Algorithm: Clustering-based Duplicate Group Detection
INPUT: records, text_fields
PROCESS:
1. PREPARE vectorized features from records and text_fields
2. APPLY density-based clustering to tfidf_matrix
3. GET cluster_labels for each record
4. INITIALIZE clusters = empty dictionary
5. FOR each record_index, cluster_label in cluster_labels:
a. IF cluster_label != -1 (not noise) THEN
IF cluster_label not in clusters THEN
CREATE empty list for cluster_label
ADD record information to clusters[cluster_label]
6. CREATE duplicate_groups from clusters:
FOR each cluster_id, cluster_records in clusters:
IF cluster has more than 1 record THEN
CREATE duplicate_group with:
- cluster_id
- all records in cluster
- similarity_scores calculation
ADD to duplicate_groups
7. RETURN duplicate_groups
OUTPUT: ml_detected_duplicate_clusters
```
### 3. Business Rule Integration
**Custom Duplicate Rules**
```
Algorithm: Business Rule Application for Duplicate Detection
INPUT: record1, record2, rule_name, business_rules
PROCESS:
1. IF rule_name not in business_rules THEN
RETURN no_match_result with confidence 0
2. GET rule configuration for rule_name
3. INITIALIZE rule_results = empty list
4. FOR each condition in rule conditions:
EVALUATE condition between record1 and record2
ADD evaluation result to rule_results
5. COMBINE rule_results based on logic operator:
a. IF rule logic = "AND" THEN
overall_match = all individual matches
overall_confidence = minimum confidence
b. ELSE IF rule logic = "OR" THEN
overall_match = any individual match
overall_confidence = maximum confidence
6. RETURN rule_evaluation_result with:
- overall match status
- overall confidence
- rule_name
- individual condition results
OUTPUT: business_rule_match_result
```
```
Algorithm: Individual Rule Condition Evaluation
INPUT: record1, record2, condition_definition
PROCESS:
1. GET field1 = record1[condition.field]
2. GET field2 = record2[condition.field]
3. EVALUATE based on condition type:
a. IF condition.type = "exact_match" THEN
match = (field1 == field2)
confidence = 1.0 if match else 0.0
b. ELSE IF condition.type = "fuzzy_match" THEN
IF both field1 and field2 exist THEN
CALCULATE similarity using fuzzy string matching
match = (similarity >= condition.threshold)
confidence = similarity
ELSE
match = false, confidence = 0.0
c. ELSE IF condition.type = "range_match" THEN
IF both field1 and field2 are numeric THEN
TRY:
CONVERT to numbers
CALCULATE diff_percent = |num1 - num2| / max(num1, num2) * 100
match = (diff_percent <= condition.tolerance_percent)
confidence = max(0, 1 - diff_percent / 100)
CATCH conversion errors:
match = false, confidence = 0.0
ELSE
match = false, confidence = 0.0
4. RETURN condition_evaluation_result with:
- condition name, match status, confidence
- field name and both values
OUTPUT: condition_evaluation_result
```
## Implementation Steps
### Step 1: Duplicate Analysis Setup
**Analysis Configuration**
```yaml
duplicate_analysis_config:
objects_to_analyze:
Account:
match_fields: [Name, Website, Phone, BillingPostalCode]
algorithms: [fuzzy, exact, business_rule]
confidence_threshold: 0.8
Contact:
match_fields: [FirstName, LastName, Email, Phone]
algorithms: [fuzzy, phonetic]
confidence_threshold: 0.85
Lead:
match_fields: [FirstName, LastName, Email, Company, Phone]
algorithms: [fuzzy, ml_clustering]
confidence_threshold: 0.75
processing_options:
batch_size: 1000
parallel_processing: true
max_workers: 4
progress_reporting: true
```
**Duplicate Analysis Execution**
```
Algorithm: Comprehensive Duplicate Analysis Orchestration
INPUT: object_names (optional), configuration, salesforce_connection
PROCESS:
1. IF object_names not provided THEN
GET object_names from configuration objects_to_analyze
2. INITIALIZE analysis_results = empty dictionary
3. FOR each object_name in object_names:
a. LOG analysis start for object_name
b. GET object_configuration for object_name
c. EXTRACT records for analysis using object_configuration
d. INITIALIZE object_results = empty dictionary
e. FOR each algorithm in object_configuration algorithms:
- LOG algorithm execution
- RUN duplicate detection with records, match_fields, algorithm
- STORE results in object_results[algorithm]
f. CONSOLIDATE algorithm results and rank findings
g. COMPILE analysis_results[object_name] with:
- total_records_analyzed count
- duplicate_groups_found count
- algorithm_results details
- consolidated_results
- analysis_metadata with timestamp and config
4. RETURN complete analysis_results
OUTPUT: comprehensive_duplicate_analysis_report
```
```
Algorithm: Record Extraction for Duplicate Analysis
INPUT: object_name, object_configuration
PROCESS:
1. COMBINE match_fields with standard fields [Id, CreatedDate, LastModifiedDate]
2. REMOVE duplicates and CREATE field_list
3. BUILD query:
SELECT field_list FROM object_name
WHERE IsDeleted = FALSE
ORDER BY CreatedDate DESC
4. EXECUTE query against Salesforce connection
5. RETURN records from query result
OUTPUT: extracted_records_for_analysis
```
### Step 2: Pattern Analysis and Root Cause Identification
**Duplicate Pattern Analysis**
```
Algorithm: Comprehensive Duplicate Pattern Analysis
INPUT: duplicate_groups, object_name
PROCESS:
1. CREATE pattern_analysis with components:
- temporal_patterns from duplicate_groups
- user_patterns from creation patterns
- data_source_patterns analysis
- field_variation_patterns analysis
- similarity_distribution analysis
2. RETURN complete pattern_analysis
OUTPUT: duplicate_pattern_analysis_report
```
```
Algorithm: Temporal Pattern Analysis for Duplicates
INPUT: duplicate_groups
PROCESS:
1. INITIALIZE temporal_data = empty list
2. FOR each group in duplicate_groups:
FOR each record in group duplicates:
IF record has CreatedDate THEN
EXTRACT temporal information:
- created_date, hour (from timestamp)
- day_of_week (from date conversion)
- group_id
ADD to temporal_data
3. INITIALIZE hourly_distribution = empty dict, daily_distribution = empty dict
4. FOR each data_point in temporal_data:
INCREMENT hourly_distribution[hour]
INCREMENT daily_distribution[day_of_week]
5. DETERMINE peak_creation_hour and peak_creation_day from distributions
6. RETURN temporal_analysis with:
- hourly and daily distributions
- peak creation times
- total duplicates analyzed count
OUTPUT: temporal_duplicate_patterns
```
```
Algorithm: Root Cause Analysis Generation
INPUT: pattern_analysis
PROCESS:
1. INITIALIZE root_causes = empty list
2. ANALYZE temporal patterns:
IF peak_creation_hour in [9, 10, 11] (morning hours) THEN
ADD root_cause:
- category: "Process Issue"
- cause: "Morning data entry rush"
- evidence: peak hour information
- recommendation: real-time prevention during peak hours
3. ANALYZE user patterns:
IF high_duplicate_users count > 0 THEN
ADD root_cause:
- category: "Training Issue"
- cause: "Specific users creating many duplicates"
- evidence: user count information
- recommendation: targeted training
4. RETURN complete root_causes analysis
OUTPUT: root_cause_analysis_report
```
### Step 3: Prevention and Management
**Automated Prevention Rules**
```
Algorithm: Salesforce Duplicate Rules Creation
INPUT: analysis_results
PROCESS:
1. INITIALIZE duplicate_rules = empty list
2. FOR each object_name, results in analysis_results:
a. IDENTIFY best_algorithm from algorithm_results
b. CREATE matching_rule configuration:
- sobjectType: "MatchingRule"
- DeveloperName: object_name + "_Duplicate_Rule"
- MasterLabel: object_name + " Duplicate Detection Rule"
- SobjectType: object_name
- MatchingRuleItems from best_algorithm and results
c. CREATE duplicate_rule configuration:
- sobjectType: "DuplicateRule"
- DeveloperName: object_name + "_Duplicate_Prevention"
- MasterLabel: object_name + " Duplicate Prevention"
- SobjectType: object_name
- ActionOnInsert: "Block"
- ActionOnUpdate: "Allow"
- AlertText: "Potential duplicate record detected"
- MatchingRule: matching_rule DeveloperName
d. ADD rule_pair to duplicate_rules
3. RETURN complete duplicate_rules
OUTPUT: salesforce_duplicate_rules_configuration
```
```
Algorithm: Custom Prevention Logic Implementation
INPUT: prevention_configuration
PROCESS:
1. GENERATE apex_trigger code:
- CREATE trigger on prevention_config object
- SET trigger events: before insert, before update
- CALL DuplicatePreventionHandler.handleDuplicatePrevention
2. GENERATE apex_handler code:
- CREATE public class DuplicatePreventionHandler
- IMPLEMENT handleDuplicatePrevention method:
FOR each record in newRecords:
FIND potential_duplicates for record
IF duplicates found THEN
ADD error to record with duplicate message
- IMPLEMENT findPotentialDuplicates method (custom matching logic)
- IMPLEMENT buildDuplicateMessage method (user-friendly message)
3. RETURN code_package with:
- trigger_code
- handler_code
OUTPUT: custom_duplicate_prevention_apex_code
```
## Success Criteria
✅ Comprehensive duplicate detection implemented ✅ Pattern analysis completed
and documented ✅ Root cause analysis generated ✅ Prevention rules configured
and active ✅ Automated monitoring established ✅ Data cleansing procedures
operational ✅ User training materials created ✅ Ongoing duplicate management
process established