sf-agent-framework

# Duplicate Analysis Task This task guides the comprehensive detection, analysis, and resolution of duplicate records in Salesforce to maintain data quality and integrity. ## Purpose Enable data quality analysts to: - Detect duplicate records using advanced algorithms - Analyze duplicate patterns and root causes - Implement prevention strategies - Develop automated duplicate management processes - Maintain ongoing duplicate monitoring and resolution ## Prerequisites - Access to Salesforce data and duplicate management tools - Understanding of fuzzy matching algorithms - Knowledge of business processes and data entry patterns - Duplicate rule configuration permissions - Data cleansing tools and techniques familiarity ## Duplicate Analysis Framework ### 1. Detection Methodology **Multi-Level Duplicate Detection** ```yaml Detection_Levels: Exact_Match: Criteria: Field values match exactly Use_Cases: Simple data entry errors Confidence: 100% Processing: Fast, automated Fuzzy_Match: Criteria: Field values similar within threshold Use_Cases: Variations in spelling, formatting Confidence: 80-95% Processing: Algorithm-based matching Probabilistic_Match: Criteria: Multiple fields contribute to match score Use_Cases: Complex duplicate scenarios Confidence: 60-90% Processing: Machine learning models Business_Rule_Match: Criteria: Custom business logic Use_Cases: Industry-specific duplicates Confidence: Variable Processing: Rule engine evaluation ``` **Detection Algorithms** ``` Algorithm: Duplicate Detection Engine Initialization INPUT: configuration PROCESS: 1. SET similarity_threshold from config (default: 0.85) 2. DEFINE available algorithms: - exact: exact field matching - fuzzy: fuzzy string matching - levenshtein: edit distance calculation - phonetic: sound-based matching - token: token-based comparison 3. INITIALIZE detection engine with algorithms OUTPUT: configured_duplicate_detection_engine ``` ``` Algorithm: Duplicate Detection using Specified Algorithm INPUT: records, match_fields, algorithm_type (default: 'fuzzy') PROCESS: 1. VALIDATE algorithm_type exists in available algorithms 2. IF algorithm_type not found THEN RAISE ValueError with algorithm name 3. SELECT detection_function for algorithm_type 4. EXECUTE detection_function with records and match_fields 5. FORMAT duplicate_results with algorithm information 6. RETURN formatted duplicate results OUTPUT: duplicate_detection_results ``` ``` Algorithm: Fuzzy String Matching for Duplicate Detection INPUT: records, match_fields PROCESS: 1. INITIALIZE duplicate_groups = empty list, processed_ids = empty set 2. FOR each record1 in records: a. IF record1.Id already in processed_ids THEN continue b. INITIALIZE potential_duplicates = [record1] c. FOR each record2 in remaining records: - IF record2.Id in processed_ids THEN continue - CALCULATE similarity_score between record1 and record2 - IF similarity_score >= similarity_threshold THEN ADD record2 to potential_duplicates ADD record2.Id to processed_ids d. IF potential_duplicates count > 1 THEN CREATE duplicate_group with: - master_candidate selection - all duplicate records - confidence_score calculation - match_reasons analysis ADD duplicate_group to duplicate_groups e. ADD record1.Id to processed_ids 3. RETURN duplicate_groups OUTPUT: fuzzy_match_duplicate_groups ``` ``` Algorithm: Record Similarity Calculation INPUT: record1, record2, fields_to_compare PROCESS: 1. DEFINE field_weights: Name=0.4, Email=0.3, Phone=0.2, Website=0.1 2. INITIALIZE total_weight=0, weighted_similarity=0 3. FOR each field in fields_to_compare: a. IF field has defined weight THEN - GET value1 from record1[field], normalize (strip, lowercase) - GET value2 from record2[field], normalize (strip, lowercase) - IF both values exist THEN CALCULATE similarity using fuzzy string matching GET field_weight for field ADD (similarity * field_weight) to weighted_similarity ADD field_weight to total_weight 4. IF total_weight > 0 THEN RETURN weighted_similarity / total_weight 5. ELSE RETURN 0 OUTPUT: similarity_score (0.0 to 1.0) ``` ### 2. Advanced Detection Techniques **Machine Learning Duplicate Detection** ``` Algorithm: ML Duplicate Detector Initialization INPUT: configuration PROCESS: 1. INITIALIZE text_vectorizer with n-gram range (1,2) 2. CONFIGURE clustering_model with density-based clustering - eps = 0.3 (neighborhood distance) - min_samples = 2 (minimum cluster size) 3. RETURN configured ML detector OUTPUT: ml_duplicate_detection_system ``` ``` Algorithm: Text Feature Preparation for ML Analysis INPUT: records, text_fields PROCESS: 1. INITIALIZE feature_matrix = empty list, record_ids = empty list 2. FOR each record in records: a. COMBINE text fields into single feature string: - FOR each field in text_fields: GET field value from record, normalize (strip, lowercase) - JOIN all field values with spaces b. ADD combined_text to feature_matrix c. ADD record.Id to record_ids 3. VECTORIZE feature_matrix using TF-IDF transformation 4. RETURN tfidf_matrix and record_ids OUTPUT: vectorized_features_and_record_mapping ``` ``` Algorithm: Clustering-based Duplicate Group Detection INPUT: records, text_fields PROCESS: 1. PREPARE vectorized features from records and text_fields 2. APPLY density-based clustering to tfidf_matrix 3. GET cluster_labels for each record 4. INITIALIZE clusters = empty dictionary 5. FOR each record_index, cluster_label in cluster_labels: a. IF cluster_label != -1 (not noise) THEN IF cluster_label not in clusters THEN CREATE empty list for cluster_label ADD record information to clusters[cluster_label] 6. CREATE duplicate_groups from clusters: FOR each cluster_id, cluster_records in clusters: IF cluster has more than 1 record THEN CREATE duplicate_group with: - cluster_id - all records in cluster - similarity_scores calculation ADD to duplicate_groups 7. RETURN duplicate_groups OUTPUT: ml_detected_duplicate_clusters ``` ### 3. Business Rule Integration **Custom Duplicate Rules** ``` Algorithm: Business Rule Application for Duplicate Detection INPUT: record1, record2, rule_name, business_rules PROCESS: 1. IF rule_name not in business_rules THEN RETURN no_match_result with confidence 0 2. GET rule configuration for rule_name 3. INITIALIZE rule_results = empty list 4. FOR each condition in rule conditions: EVALUATE condition between record1 and record2 ADD evaluation result to rule_results 5. COMBINE rule_results based on logic operator: a. IF rule logic = "AND" THEN overall_match = all individual matches overall_confidence = minimum confidence b. ELSE IF rule logic = "OR" THEN overall_match = any individual match overall_confidence = maximum confidence 6. RETURN rule_evaluation_result with: - overall match status - overall confidence - rule_name - individual condition results OUTPUT: business_rule_match_result ``` ``` Algorithm: Individual Rule Condition Evaluation INPUT: record1, record2, condition_definition PROCESS: 1. GET field1 = record1[condition.field] 2. GET field2 = record2[condition.field] 3. EVALUATE based on condition type: a. IF condition.type = "exact_match" THEN match = (field1 == field2) confidence = 1.0 if match else 0.0 b. ELSE IF condition.type = "fuzzy_match" THEN IF both field1 and field2 exist THEN CALCULATE similarity using fuzzy string matching match = (similarity >= condition.threshold) confidence = similarity ELSE match = false, confidence = 0.0 c. ELSE IF condition.type = "range_match" THEN IF both field1 and field2 are numeric THEN TRY: CONVERT to numbers CALCULATE diff_percent = |num1 - num2| / max(num1, num2) * 100 match = (diff_percent <= condition.tolerance_percent) confidence = max(0, 1 - diff_percent / 100) CATCH conversion errors: match = false, confidence = 0.0 ELSE match = false, confidence = 0.0 4. RETURN condition_evaluation_result with: - condition name, match status, confidence - field name and both values OUTPUT: condition_evaluation_result ``` ## Implementation Steps ### Step 1: Duplicate Analysis Setup **Analysis Configuration** ```yaml duplicate_analysis_config: objects_to_analyze: Account: match_fields: [Name, Website, Phone, BillingPostalCode] algorithms: [fuzzy, exact, business_rule] confidence_threshold: 0.8 Contact: match_fields: [FirstName, LastName, Email, Phone] algorithms: [fuzzy, phonetic] confidence_threshold: 0.85 Lead: match_fields: [FirstName, LastName, Email, Company, Phone] algorithms: [fuzzy, ml_clustering] confidence_threshold: 0.75 processing_options: batch_size: 1000 parallel_processing: true max_workers: 4 progress_reporting: true ``` **Duplicate Analysis Execution** ``` Algorithm: Comprehensive Duplicate Analysis Orchestration INPUT: object_names (optional), configuration, salesforce_connection PROCESS: 1. IF object_names not provided THEN GET object_names from configuration objects_to_analyze 2. INITIALIZE analysis_results = empty dictionary 3. FOR each object_name in object_names: a. LOG analysis start for object_name b. GET object_configuration for object_name c. EXTRACT records for analysis using object_configuration d. INITIALIZE object_results = empty dictionary e. FOR each algorithm in object_configuration algorithms: - LOG algorithm execution - RUN duplicate detection with records, match_fields, algorithm - STORE results in object_results[algorithm] f. CONSOLIDATE algorithm results and rank findings g. COMPILE analysis_results[object_name] with: - total_records_analyzed count - duplicate_groups_found count - algorithm_results details - consolidated_results - analysis_metadata with timestamp and config 4. RETURN complete analysis_results OUTPUT: comprehensive_duplicate_analysis_report ``` ``` Algorithm: Record Extraction for Duplicate Analysis INPUT: object_name, object_configuration PROCESS: 1. COMBINE match_fields with standard fields [Id, CreatedDate, LastModifiedDate] 2. REMOVE duplicates and CREATE field_list 3. BUILD query: SELECT field_list FROM object_name WHERE IsDeleted = FALSE ORDER BY CreatedDate DESC 4. EXECUTE query against Salesforce connection 5. RETURN records from query result OUTPUT: extracted_records_for_analysis ``` ### Step 2: Pattern Analysis and Root Cause Identification **Duplicate Pattern Analysis** ``` Algorithm: Comprehensive Duplicate Pattern Analysis INPUT: duplicate_groups, object_name PROCESS: 1. CREATE pattern_analysis with components: - temporal_patterns from duplicate_groups - user_patterns from creation patterns - data_source_patterns analysis - field_variation_patterns analysis - similarity_distribution analysis 2. RETURN complete pattern_analysis OUTPUT: duplicate_pattern_analysis_report ``` ``` Algorithm: Temporal Pattern Analysis for Duplicates INPUT: duplicate_groups PROCESS: 1. INITIALIZE temporal_data = empty list 2. FOR each group in duplicate_groups: FOR each record in group duplicates: IF record has CreatedDate THEN EXTRACT temporal information: - created_date, hour (from timestamp) - day_of_week (from date conversion) - group_id ADD to temporal_data 3. INITIALIZE hourly_distribution = empty dict, daily_distribution = empty dict 4. FOR each data_point in temporal_data: INCREMENT hourly_distribution[hour] INCREMENT daily_distribution[day_of_week] 5. DETERMINE peak_creation_hour and peak_creation_day from distributions 6. RETURN temporal_analysis with: - hourly and daily distributions - peak creation times - total duplicates analyzed count OUTPUT: temporal_duplicate_patterns ``` ``` Algorithm: Root Cause Analysis Generation INPUT: pattern_analysis PROCESS: 1. INITIALIZE root_causes = empty list 2. ANALYZE temporal patterns: IF peak_creation_hour in [9, 10, 11] (morning hours) THEN ADD root_cause: - category: "Process Issue" - cause: "Morning data entry rush" - evidence: peak hour information - recommendation: real-time prevention during peak hours 3. ANALYZE user patterns: IF high_duplicate_users count > 0 THEN ADD root_cause: - category: "Training Issue" - cause: "Specific users creating many duplicates" - evidence: user count information - recommendation: targeted training 4. RETURN complete root_causes analysis OUTPUT: root_cause_analysis_report ``` ### Step 3: Prevention and Management **Automated Prevention Rules** ``` Algorithm: Salesforce Duplicate Rules Creation INPUT: analysis_results PROCESS: 1. INITIALIZE duplicate_rules = empty list 2. FOR each object_name, results in analysis_results: a. IDENTIFY best_algorithm from algorithm_results b. CREATE matching_rule configuration: - sobjectType: "MatchingRule" - DeveloperName: object_name + "_Duplicate_Rule" - MasterLabel: object_name + " Duplicate Detection Rule" - SobjectType: object_name - MatchingRuleItems from best_algorithm and results c. CREATE duplicate_rule configuration: - sobjectType: "DuplicateRule" - DeveloperName: object_name + "_Duplicate_Prevention" - MasterLabel: object_name + " Duplicate Prevention" - SobjectType: object_name - ActionOnInsert: "Block" - ActionOnUpdate: "Allow" - AlertText: "Potential duplicate record detected" - MatchingRule: matching_rule DeveloperName d. ADD rule_pair to duplicate_rules 3. RETURN complete duplicate_rules OUTPUT: salesforce_duplicate_rules_configuration ``` ``` Algorithm: Custom Prevention Logic Implementation INPUT: prevention_configuration PROCESS: 1. GENERATE apex_trigger code: - CREATE trigger on prevention_config object - SET trigger events: before insert, before update - CALL DuplicatePreventionHandler.handleDuplicatePrevention 2. GENERATE apex_handler code: - CREATE public class DuplicatePreventionHandler - IMPLEMENT handleDuplicatePrevention method: FOR each record in newRecords: FIND potential_duplicates for record IF duplicates found THEN ADD error to record with duplicate message - IMPLEMENT findPotentialDuplicates method (custom matching logic) - IMPLEMENT buildDuplicateMessage method (user-friendly message) 3. RETURN code_package with: - trigger_code - handler_code OUTPUT: custom_duplicate_prevention_apex_code ``` ## Success Criteria ✅ Comprehensive duplicate detection implemented ✅ Pattern analysis completed and documented ✅ Root cause analysis generated ✅ Prevention rules configured and active ✅ Automated monitoring established ✅ Data cleansing procedures operational ✅ User training materials created ✅ Ongoing duplicate management process established