UNPKG

sf-agent-framework

Version:

AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction

460 lines (411 loc) 15.8 kB
# Duplicate Analysis Task This task guides the comprehensive detection, analysis, and resolution of duplicate records in Salesforce to maintain data quality and integrity. ## Purpose Enable data quality analysts to: - Detect duplicate records using advanced algorithms - Analyze duplicate patterns and root causes - Implement prevention strategies - Develop automated duplicate management processes - Maintain ongoing duplicate monitoring and resolution ## Prerequisites - Access to Salesforce data and duplicate management tools - Understanding of fuzzy matching algorithms - Knowledge of business processes and data entry patterns - Duplicate rule configuration permissions - Data cleansing tools and techniques familiarity ## Duplicate Analysis Framework ### 1. Detection Methodology **Multi-Level Duplicate Detection** ```yaml Detection_Levels: Exact_Match: Criteria: Field values match exactly Use_Cases: Simple data entry errors Confidence: 100% Processing: Fast, automated Fuzzy_Match: Criteria: Field values similar within threshold Use_Cases: Variations in spelling, formatting Confidence: 80-95% Processing: Algorithm-based matching Probabilistic_Match: Criteria: Multiple fields contribute to match score Use_Cases: Complex duplicate scenarios Confidence: 60-90% Processing: Machine learning models Business_Rule_Match: Criteria: Custom business logic Use_Cases: Industry-specific duplicates Confidence: Variable Processing: Rule engine evaluation ``` **Detection Algorithms** ``` Algorithm: Duplicate Detection Engine Initialization INPUT: configuration PROCESS: 1. SET similarity_threshold from config (default: 0.85) 2. DEFINE available algorithms: - exact: exact field matching - fuzzy: fuzzy string matching - levenshtein: edit distance calculation - phonetic: sound-based matching - token: token-based comparison 3. INITIALIZE detection engine with algorithms OUTPUT: configured_duplicate_detection_engine ``` ``` Algorithm: Duplicate Detection using Specified Algorithm INPUT: records, match_fields, algorithm_type (default: 'fuzzy') PROCESS: 1. VALIDATE algorithm_type exists in available algorithms 2. IF algorithm_type not found THEN RAISE ValueError with algorithm name 3. SELECT detection_function for algorithm_type 4. EXECUTE detection_function with records and match_fields 5. FORMAT duplicate_results with algorithm information 6. RETURN formatted duplicate results OUTPUT: duplicate_detection_results ``` ``` Algorithm: Fuzzy String Matching for Duplicate Detection INPUT: records, match_fields PROCESS: 1. INITIALIZE duplicate_groups = empty list, processed_ids = empty set 2. FOR each record1 in records: a. IF record1.Id already in processed_ids THEN continue b. INITIALIZE potential_duplicates = [record1] c. FOR each record2 in remaining records: - IF record2.Id in processed_ids THEN continue - CALCULATE similarity_score between record1 and record2 - IF similarity_score >= similarity_threshold THEN ADD record2 to potential_duplicates ADD record2.Id to processed_ids d. IF potential_duplicates count > 1 THEN CREATE duplicate_group with: - master_candidate selection - all duplicate records - confidence_score calculation - match_reasons analysis ADD duplicate_group to duplicate_groups e. ADD record1.Id to processed_ids 3. RETURN duplicate_groups OUTPUT: fuzzy_match_duplicate_groups ``` ``` Algorithm: Record Similarity Calculation INPUT: record1, record2, fields_to_compare PROCESS: 1. DEFINE field_weights: Name=0.4, Email=0.3, Phone=0.2, Website=0.1 2. INITIALIZE total_weight=0, weighted_similarity=0 3. FOR each field in fields_to_compare: a. IF field has defined weight THEN - GET value1 from record1[field], normalize (strip, lowercase) - GET value2 from record2[field], normalize (strip, lowercase) - IF both values exist THEN CALCULATE similarity using fuzzy string matching GET field_weight for field ADD (similarity * field_weight) to weighted_similarity ADD field_weight to total_weight 4. IF total_weight > 0 THEN RETURN weighted_similarity / total_weight 5. ELSE RETURN 0 OUTPUT: similarity_score (0.0 to 1.0) ``` ### 2. Advanced Detection Techniques **Machine Learning Duplicate Detection** ``` Algorithm: ML Duplicate Detector Initialization INPUT: configuration PROCESS: 1. INITIALIZE text_vectorizer with n-gram range (1,2) 2. CONFIGURE clustering_model with density-based clustering - eps = 0.3 (neighborhood distance) - min_samples = 2 (minimum cluster size) 3. RETURN configured ML detector OUTPUT: ml_duplicate_detection_system ``` ``` Algorithm: Text Feature Preparation for ML Analysis INPUT: records, text_fields PROCESS: 1. INITIALIZE feature_matrix = empty list, record_ids = empty list 2. FOR each record in records: a. COMBINE text fields into single feature string: - FOR each field in text_fields: GET field value from record, normalize (strip, lowercase) - JOIN all field values with spaces b. ADD combined_text to feature_matrix c. ADD record.Id to record_ids 3. VECTORIZE feature_matrix using TF-IDF transformation 4. RETURN tfidf_matrix and record_ids OUTPUT: vectorized_features_and_record_mapping ``` ``` Algorithm: Clustering-based Duplicate Group Detection INPUT: records, text_fields PROCESS: 1. PREPARE vectorized features from records and text_fields 2. APPLY density-based clustering to tfidf_matrix 3. GET cluster_labels for each record 4. INITIALIZE clusters = empty dictionary 5. FOR each record_index, cluster_label in cluster_labels: a. IF cluster_label != -1 (not noise) THEN IF cluster_label not in clusters THEN CREATE empty list for cluster_label ADD record information to clusters[cluster_label] 6. CREATE duplicate_groups from clusters: FOR each cluster_id, cluster_records in clusters: IF cluster has more than 1 record THEN CREATE duplicate_group with: - cluster_id - all records in cluster - similarity_scores calculation ADD to duplicate_groups 7. RETURN duplicate_groups OUTPUT: ml_detected_duplicate_clusters ``` ### 3. Business Rule Integration **Custom Duplicate Rules** ``` Algorithm: Business Rule Application for Duplicate Detection INPUT: record1, record2, rule_name, business_rules PROCESS: 1. IF rule_name not in business_rules THEN RETURN no_match_result with confidence 0 2. GET rule configuration for rule_name 3. INITIALIZE rule_results = empty list 4. FOR each condition in rule conditions: EVALUATE condition between record1 and record2 ADD evaluation result to rule_results 5. COMBINE rule_results based on logic operator: a. IF rule logic = "AND" THEN overall_match = all individual matches overall_confidence = minimum confidence b. ELSE IF rule logic = "OR" THEN overall_match = any individual match overall_confidence = maximum confidence 6. RETURN rule_evaluation_result with: - overall match status - overall confidence - rule_name - individual condition results OUTPUT: business_rule_match_result ``` ``` Algorithm: Individual Rule Condition Evaluation INPUT: record1, record2, condition_definition PROCESS: 1. GET field1 = record1[condition.field] 2. GET field2 = record2[condition.field] 3. EVALUATE based on condition type: a. IF condition.type = "exact_match" THEN match = (field1 == field2) confidence = 1.0 if match else 0.0 b. ELSE IF condition.type = "fuzzy_match" THEN IF both field1 and field2 exist THEN CALCULATE similarity using fuzzy string matching match = (similarity >= condition.threshold) confidence = similarity ELSE match = false, confidence = 0.0 c. ELSE IF condition.type = "range_match" THEN IF both field1 and field2 are numeric THEN TRY: CONVERT to numbers CALCULATE diff_percent = |num1 - num2| / max(num1, num2) * 100 match = (diff_percent <= condition.tolerance_percent) confidence = max(0, 1 - diff_percent / 100) CATCH conversion errors: match = false, confidence = 0.0 ELSE match = false, confidence = 0.0 4. RETURN condition_evaluation_result with: - condition name, match status, confidence - field name and both values OUTPUT: condition_evaluation_result ``` ## Implementation Steps ### Step 1: Duplicate Analysis Setup **Analysis Configuration** ```yaml duplicate_analysis_config: objects_to_analyze: Account: match_fields: [Name, Website, Phone, BillingPostalCode] algorithms: [fuzzy, exact, business_rule] confidence_threshold: 0.8 Contact: match_fields: [FirstName, LastName, Email, Phone] algorithms: [fuzzy, phonetic] confidence_threshold: 0.85 Lead: match_fields: [FirstName, LastName, Email, Company, Phone] algorithms: [fuzzy, ml_clustering] confidence_threshold: 0.75 processing_options: batch_size: 1000 parallel_processing: true max_workers: 4 progress_reporting: true ``` **Duplicate Analysis Execution** ``` Algorithm: Comprehensive Duplicate Analysis Orchestration INPUT: object_names (optional), configuration, salesforce_connection PROCESS: 1. IF object_names not provided THEN GET object_names from configuration objects_to_analyze 2. INITIALIZE analysis_results = empty dictionary 3. FOR each object_name in object_names: a. LOG analysis start for object_name b. GET object_configuration for object_name c. EXTRACT records for analysis using object_configuration d. INITIALIZE object_results = empty dictionary e. FOR each algorithm in object_configuration algorithms: - LOG algorithm execution - RUN duplicate detection with records, match_fields, algorithm - STORE results in object_results[algorithm] f. CONSOLIDATE algorithm results and rank findings g. COMPILE analysis_results[object_name] with: - total_records_analyzed count - duplicate_groups_found count - algorithm_results details - consolidated_results - analysis_metadata with timestamp and config 4. RETURN complete analysis_results OUTPUT: comprehensive_duplicate_analysis_report ``` ``` Algorithm: Record Extraction for Duplicate Analysis INPUT: object_name, object_configuration PROCESS: 1. COMBINE match_fields with standard fields [Id, CreatedDate, LastModifiedDate] 2. REMOVE duplicates and CREATE field_list 3. BUILD query: SELECT field_list FROM object_name WHERE IsDeleted = FALSE ORDER BY CreatedDate DESC 4. EXECUTE query against Salesforce connection 5. RETURN records from query result OUTPUT: extracted_records_for_analysis ``` ### Step 2: Pattern Analysis and Root Cause Identification **Duplicate Pattern Analysis** ``` Algorithm: Comprehensive Duplicate Pattern Analysis INPUT: duplicate_groups, object_name PROCESS: 1. CREATE pattern_analysis with components: - temporal_patterns from duplicate_groups - user_patterns from creation patterns - data_source_patterns analysis - field_variation_patterns analysis - similarity_distribution analysis 2. RETURN complete pattern_analysis OUTPUT: duplicate_pattern_analysis_report ``` ``` Algorithm: Temporal Pattern Analysis for Duplicates INPUT: duplicate_groups PROCESS: 1. INITIALIZE temporal_data = empty list 2. FOR each group in duplicate_groups: FOR each record in group duplicates: IF record has CreatedDate THEN EXTRACT temporal information: - created_date, hour (from timestamp) - day_of_week (from date conversion) - group_id ADD to temporal_data 3. INITIALIZE hourly_distribution = empty dict, daily_distribution = empty dict 4. FOR each data_point in temporal_data: INCREMENT hourly_distribution[hour] INCREMENT daily_distribution[day_of_week] 5. DETERMINE peak_creation_hour and peak_creation_day from distributions 6. RETURN temporal_analysis with: - hourly and daily distributions - peak creation times - total duplicates analyzed count OUTPUT: temporal_duplicate_patterns ``` ``` Algorithm: Root Cause Analysis Generation INPUT: pattern_analysis PROCESS: 1. INITIALIZE root_causes = empty list 2. ANALYZE temporal patterns: IF peak_creation_hour in [9, 10, 11] (morning hours) THEN ADD root_cause: - category: "Process Issue" - cause: "Morning data entry rush" - evidence: peak hour information - recommendation: real-time prevention during peak hours 3. ANALYZE user patterns: IF high_duplicate_users count > 0 THEN ADD root_cause: - category: "Training Issue" - cause: "Specific users creating many duplicates" - evidence: user count information - recommendation: targeted training 4. RETURN complete root_causes analysis OUTPUT: root_cause_analysis_report ``` ### Step 3: Prevention and Management **Automated Prevention Rules** ``` Algorithm: Salesforce Duplicate Rules Creation INPUT: analysis_results PROCESS: 1. INITIALIZE duplicate_rules = empty list 2. FOR each object_name, results in analysis_results: a. IDENTIFY best_algorithm from algorithm_results b. CREATE matching_rule configuration: - sobjectType: "MatchingRule" - DeveloperName: object_name + "_Duplicate_Rule" - MasterLabel: object_name + " Duplicate Detection Rule" - SobjectType: object_name - MatchingRuleItems from best_algorithm and results c. CREATE duplicate_rule configuration: - sobjectType: "DuplicateRule" - DeveloperName: object_name + "_Duplicate_Prevention" - MasterLabel: object_name + " Duplicate Prevention" - SobjectType: object_name - ActionOnInsert: "Block" - ActionOnUpdate: "Allow" - AlertText: "Potential duplicate record detected" - MatchingRule: matching_rule DeveloperName d. ADD rule_pair to duplicate_rules 3. RETURN complete duplicate_rules OUTPUT: salesforce_duplicate_rules_configuration ``` ``` Algorithm: Custom Prevention Logic Implementation INPUT: prevention_configuration PROCESS: 1. GENERATE apex_trigger code: - CREATE trigger on prevention_config object - SET trigger events: before insert, before update - CALL DuplicatePreventionHandler.handleDuplicatePrevention 2. GENERATE apex_handler code: - CREATE public class DuplicatePreventionHandler - IMPLEMENT handleDuplicatePrevention method: FOR each record in newRecords: FIND potential_duplicates for record IF duplicates found THEN ADD error to record with duplicate message - IMPLEMENT findPotentialDuplicates method (custom matching logic) - IMPLEMENT buildDuplicateMessage method (user-friendly message) 3. RETURN code_package with: - trigger_code - handler_code OUTPUT: custom_duplicate_prevention_apex_code ``` ## Success Criteria Comprehensive duplicate detection implemented Pattern analysis completed and documented Root cause analysis generated Prevention rules configured and active Automated monitoring established Data cleansing procedures operational User training materials created Ongoing duplicate management process established