UNPKG

sf-agent-framework

Version:

AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction

628 lines (539 loc) 18.4 kB
# Data Extraction Task This task guides the systematic extraction of data from various sources for Salesforce migration, integration, and analysis purposes. ## Purpose Enable ETL developers to: - Design efficient data extraction processes - Ensure data integrity and completeness - Optimize extraction performance - Handle complex data relationships - Maintain security and compliance ## Prerequisites - Access to source systems and databases - Understanding of source data schemas - Salesforce target data model knowledge - Appropriate security credentials and permissions - Data extraction tools and infrastructure ## Data Extraction Framework ### 1. Source System Analysis **Data Source Assessment** ```yaml Source Systems: Salesforce_Org: Type: Salesforce Production/Sandbox Access: REST API, SOAP API, Bulk API Limits: API call limits, data storage Format: JSON, XML, CSV Database_Systems: Type: SQL Server, Oracle, MySQL, PostgreSQL Access: Direct connection, ODBC, JDBC Constraints: Connection limits, query timeouts Format: Tables, views, stored procedures File_Systems: Type: CSV, JSON, XML, Excel Access: FTP, SFTP, cloud storage Structure: Flat files, hierarchical Format: Delimited, fixed-width Web_Services: Type: REST APIs, SOAP services Authentication: OAuth, API keys, tokens Rate_Limits: Requests per minute/hour Format: JSON, XML responses ``` **Data Profiling and Discovery** **Data Profiling Algorithms** ### Algorithm: Record Distribution Analysis ``` INPUT: objectName (e.g., "Account") PROCESS: 1. INITIALIZE counters: - totalRecords = 0 - uniqueValues = empty set - earliestDate = null - latestDate = null 2. FOR each record in object: - INCREMENT totalRecords - ADD field values to uniqueValues sets - UPDATE earliestDate if record.createdDate < earliestDate - UPDATE latestDate if record.createdDate > latestDate 3. CALCULATE statistics: - uniqueCount = size of uniqueValues set - dateRange = latestDate - earliestDate 4. RETURN profiling results: - Object name - Total record count - Unique value counts per field - Date range of records ``` ### Algorithm: Data Quality Assessment ``` INPUT: objectName, fieldName, qualityRule PROCESS: 1. INITIALIZE: - issueCount = 0 - totalCount = 0 2. FOR each record in object: - INCREMENT totalCount - IF qualityRule(record.fieldName) fails THEN INCREMENT issueCount 3. CALCULATE percentage = (issueCount / totalCount) * 100 4. RETURN quality metrics: - Issue description - Absolute count - Percentage of total ``` ### Algorithm: Relationship Complexity Analysis ``` INPUT: childObject, parentRelationship PROCESS: 1. CREATE relationship map: - parentCounts = dictionary - relationshipGroups = dictionary 2. FOR each child record: - GET parentId from relationship - INCREMENT parentCounts[parentId] - ADD record to relationshipGroups[relationshipName] 3. CALCULATE metrics: - totalRelationships = sum of all counts - uniqueParents = count of keys in parentCounts - averageChildrenPerParent = totalRelationships / uniqueParents 4. RETURN complexity analysis: - Relationship name - Total record count - Unique parent count - Distribution statistics ``` ### 2. Extraction Strategy Design **Extraction Patterns** ```yaml Full_Extract: Use_Cases: - Initial data migration - Complete system refresh - Data archival Considerations: - Large data volumes - Extended processing time - System impact Incremental_Extract: Use_Cases: - Regular synchronization - Change data capture - Real-time updates Methods: - Timestamp-based - Sequence-based - Change log analysis Delta_Extract: Use_Cases: - Modified records only - Efficient updates - Minimal system impact Tracking: - LastModifiedDate - SystemModstamp - Custom change flags ``` **Performance Optimization Strategy** ```json { "extraction_optimization": { "bulk_operations": { "salesforce_bulk_api": "2.0", "batch_size": 10000, "parallel_processing": true, "compression": "gzip" }, "query_optimization": { "selective_queries": true, "indexed_fields": "use in WHERE clauses", "limit_results": "paginate large datasets", "avoid_wildcards": "specify field lists" }, "resource_management": { "connection_pooling": true, "memory_management": "stream processing", "error_handling": "retry logic", "logging": "detailed audit trail" } } } ``` ## Implementation Steps ### Step 1: Environment Setup and Configuration **Connection Configuration** ```python # Salesforce Connection Example import simple_salesforce as sf from simple_salesforce.bulk import SFBulkHandler # OAuth connection setup sf_connection = sf.Salesforce( username='user@company.com', password='password', security_token='token', domain='test' # for sandbox ) # Bulk API handler for large data sets bulk_handler = SFBulkHandler( session_id=sf_connection.session_id, bulk_url=sf_connection.bulk_url ) ``` **Database Connection Setup** ```python import pyodbc import pandas as pd from sqlalchemy import create_engine # SQL Server connection connection_string = ( "DRIVER={ODBC Driver 17 for SQL Server};" "SERVER=server_name;" "DATABASE=database_name;" "UID=username;" "PWD=password" ) conn = pyodbc.connect(connection_string) # PostgreSQL connection with SQLAlchemy engine = create_engine('postgresql://user:password@localhost:5432/dbname') ``` ### Step 2: Data Extraction Implementation **Salesforce Data Extraction** ``` Algorithm: Salesforce Account Data Extraction INPUT: connection_object, batch_size (default: 10000) PROCESS: 1. DEFINE field_list = [Id, Name, Type, Industry, BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry, Phone, Website, AnnualRevenue, NumberOfEmployees, CreatedDate, LastModifiedDate] 2. INCLUDE related_data = [Contact records where IsDeleted = FALSE] 3. BUILD query with field_list and related_data 4. ESTIMATE total_record_count for Account object 5. IF total_record_count < 50000 THEN use standard SOQL query execution 6. ELSE use bulk_api_extraction method 7. RETURN extracted_data_set OUTPUT: account_records_with_related_contacts ``` ``` Algorithm: Bulk API Data Extraction INPUT: object_name, query, batch_size PROCESS: 1. CREATE bulk_extraction_job for object_name 2. SUBMIT query to bulk_api 3. INITIALIZE results = empty_collection 4. WHILE job has remaining batches: a. RETRIEVE next batch from job b. GET batch_results from batch c. APPEND batch_results to results 5. RETURN consolidated results OUTPUT: complete_extracted_dataset ``` **Database Extraction with Change Detection** ``` Algorithm: Incremental Database Extraction INPUT: table_name, timestamp_field, last_extraction_time (optional) PROCESS: 1. BUILD base_query = "SELECT * FROM " + table_name 2. IF last_extraction_time is provided THEN a. ADD WHERE clause: timestamp_field > last_extraction_time b. ADD ORDER BY timestamp_field 3. ELSE ADD ORDER BY timestamp_field only 4. EXECUTE query against database connection 5. RETURN query_results as structured dataset OUTPUT: extracted_records_since_last_extraction ``` ``` Algorithm: Database Extraction with Integrity Verification INPUT: table_name, key_field PROCESS: 1. BUILD query with all table fields 2. ADD checksum calculation for row integrity 3. ADD ORDER BY key_field for consistent results 4. EXECUTE query against database connection 5. CALCULATE row_checksum for each record 6. RETURN dataset with original_data and checksum_verification OUTPUT: records_with_integrity_checksums ``` ### Step 3: Data Relationship Handling **Hierarchical Data Extraction** ``` Algorithm: Account Hierarchy Extraction INPUT: salesforce_connection PROCESS: 1. DEFINE field_list = [Id, Name, ParentId, Type, Industry, CreatedDate, LastModifiedDate] 2. BUILD query to extract all non-deleted accounts 3. ORDER results by ParentId (nulls first), then by Name 4. EXECUTE query and retrieve all account records 5. INITIALIZE hierarchy structure: - root_accounts = empty list - child_accounts = empty dictionary - orphaned_accounts = empty list 6. FOR each account in extracted records: a. IF account.ParentId is null THEN ADD account to root_accounts b. ELSE IF parent_id not in child_accounts THEN CREATE new list for parent_id ADD account to child_accounts[parent_id] 7. RETURN structured hierarchy OUTPUT: hierarchical_account_structure ``` **Related Object Extraction** ``` Algorithm: Opportunity Ecosystem Extraction INPUT: account_ids_list PROCESS: 1. INITIALIZE opportunities = empty collection 2. FOR each account_id in account_ids_list: a. DEFINE opportunity_fields = [Id, Name, StageName, Amount, CloseDate, AccountId] b. DEFINE related_line_items = [Id, PricebookEntry.Product2.Name, Quantity, UnitPrice, TotalPrice] c. DEFINE related_tasks = [Id, Subject, ActivityDate, Status, WhoId WHERE IsClosed = FALSE] d. DEFINE related_events = [Id, Subject, ActivityDateTime, WhoId WHERE ActivityDateTime >= TODAY] e. BUILD query including opportunity_fields and all related objects f. ADD filter: AccountId = current account_id AND IsDeleted = FALSE g. EXECUTE query for current account h. APPEND results to opportunities collection 3. RETURN complete opportunities dataset OUTPUT: opportunities_with_related_objects ``` ## Advanced Extraction Techniques ### Real-time Data Extraction ``` Algorithm: Streaming API Setup for Real-time Updates INPUT: object_name, fields_list PROCESS: 1. CREATE push_topic_configuration: - Name = object_name + "_Updates" - Query = "SELECT " + join(fields_list) + " FROM " + object_name - ApiVersion = current_api_version - NotifyForOperationCreate = true - NotifyForOperationUpdate = true - NotifyForOperationDelete = true - NotifyForFields = "All" 2. SUBMIT push_topic_configuration to Salesforce 3. RECEIVE push_topic_id from creation response 4. RETURN push_topic_id for subscription OUTPUT: streaming_topic_identifier ``` ``` Algorithm: Real-time Change Listener INPUT: push_topic_name, callback_function PROCESS: 1. INITIALIZE streaming_client with: - session_id from salesforce_connection - instance_url from salesforce_connection 2. SUBSCRIBE to streaming topic: "/topic/" + push_topic_name 3. REGISTER callback_function for change notifications 4. START streaming client listener 5. CONTINUOUSLY process incoming change events 6. FOR each change event: EXECUTE callback_function with event_data OUTPUT: continuous_real_time_monitoring ``` ### Large Volume Data Handling ``` Algorithm: Large Dataset Extraction with Pagination INPUT: object_name, query, chunk_size (default: 50000) PROCESS: 1. INITIALIZE all_records = empty collection 2. MODIFY query to include LIMIT chunk_size 3. EXECUTE initial query 4. ADD initial results to all_records 5. WHILE query result indicates more records available: a. GET next_records_url from previous result b. EXECUTE query_more with next_records_url c. ADD new results to all_records d. IF all_records count is multiple of (chunk_size * 5) THEN save intermediate results for recovery 6. RETURN complete all_records collection OUTPUT: complete_large_dataset ``` ``` Algorithm: Parallel Multi-Query Extraction INPUT: queries_dictionary, max_workers (default: 5) PROCESS: 1. INITIALIZE results = empty dictionary 2. CREATE thread_pool with max_workers threads 3. FOR each query_name and query in queries_dictionary: SUBMIT query execution to thread_pool 4. COLLECT completed futures as they finish: a. GET query_name for completed future b. TRY to get future result c. IF successful THEN SET results[query_name] = query_result d. ELSE SET results[query_name] = error_information 5. RETURN results dictionary with all query outcomes OUTPUT: parallel_query_results ``` ## Data Quality and Validation ### Extraction Validation Framework ``` Algorithm: Record Count Validation INPUT: source_count, extracted_count, tolerance (default: 0.01) PROCESS: 1. CALCULATE variance = |source_count - extracted_count| / source_count 2. CALCULATE variance_percentage = variance * 100 3. DETERMINE within_tolerance = (variance <= tolerance) 4. IF within_tolerance THEN SET status = "PASS" 5. ELSE SET status = "FAIL" 6. CREATE validation_result with all metrics 7. RETURN validation_result OUTPUT: count_validation_report ``` ``` Algorithm: Data Integrity Validation INPUT: extracted_data, key_field PROCESS: 1. CONVERT extracted_data to structured format 2. COUNT total_records in dataset 3. COUNT unique_keys in key_field 4. CALCULATE duplicate_count = total_records - unique_keys 5. COUNT null_keys in key_field 6. VERIFY data_types_consistent across records 7. COMPILE integrity_metrics 8. RETURN integrity_validation_report OUTPUT: data_integrity_assessment ``` ``` Algorithm: Referential Integrity Validation INPUT: parent_data, child_data, parent_key, foreign_key PROCESS: 1. EXTRACT parent_ids from parent_data using parent_key 2. EXTRACT foreign_keys from child_data using foreign_key (exclude nulls) 3. IDENTIFY orphaned_records = foreign_keys NOT IN parent_ids 4. COUNT parent_record_count, child_record_count, orphaned_child_count 5. IF child_record_count > 0 THEN CALCULATE integrity_percentage = (1 - orphaned_child_count / child_record_count) * 100 6. ELSE SET integrity_percentage = 100 7. COMPILE referential_integrity_report 8. RETURN validation_results OUTPUT: referential_integrity_assessment ``` ## Error Handling and Recovery ### Robust Extraction Pipeline ``` Algorithm: Extraction with Retry Logic INPUT: extraction_function, function_parameters, retry_attempts (default: 3), backoff_factor (default: 2) PROCESS: 1. FOR attempt = 1 to retry_attempts: a. TRY to execute extraction_function with function_parameters b. IF successful THEN RETURN extraction_result c. IF exception occurs THEN IF attempt = retry_attempts THEN RAISE final exception ELSE CALCULATE wait_time = backoff_factor ^ (attempt - 1) LOG failure message and retry information WAIT for wait_time seconds 2. IF all attempts fail THEN RAISE extraction_failure_exception OUTPUT: successful_extraction_result OR exception ``` ``` Algorithm: Extraction Progress Checkpointing INPUT: extracted_data, checkpoint_id PROCESS: 1. CREATE checkpoint_metadata: - checkpoint_id = provided identifier - timestamp = current datetime - record_count = count of extracted_data - data = extracted_data 2. SERIALIZE checkpoint_metadata to storage format 3. SAVE to file: "checkpoint_" + checkpoint_id + ".json" 4. CONFIRM successful save operation OUTPUT: checkpoint_saved_confirmation ``` ``` Algorithm: Resume from Checkpoint INPUT: checkpoint_id PROCESS: 1. CONSTRUCT checkpoint_filename = "checkpoint_" + checkpoint_id + ".json" 2. TRY to read checkpoint_filename 3. IF file exists THEN a. DESERIALIZE checkpoint_data from file b. RETURN checkpoint_data['data'] 4. ELSE RETURN null (no checkpoint found) OUTPUT: recovered_data OR null ``` ## Performance Monitoring and Optimization ### Extraction Performance Metrics ``` Algorithm: Extraction Performance Tracking INPUT: object_name, start_time, end_time, record_count, data_size_mb PROCESS: 1. CALCULATE duration_seconds = end_time - start_time 2. IF duration_seconds > 0 THEN a. CALCULATE records_per_second = record_count / duration_seconds b. CALCULATE mb_per_second = data_size_mb / duration_seconds 3. ELSE SET records_per_second = 0, mb_per_second = 0 4. CREATE performance_metrics: - object_name, duration_seconds, records_extracted - data_size_mb, records_per_second, mb_per_second - extraction_timestamp = start_time 5. STORE metrics for object_name 6. RETURN performance_metrics OUTPUT: extraction_performance_data ``` ``` Algorithm: Performance Report Generation INPUT: collected_metrics_for_all_objects PROCESS: 1. CALCULATE summary_statistics: - total_objects = count of metrics - total_records = sum of records_extracted across all objects - total_data_size_mb = sum of data_size_mb across all objects - average_records_per_second = mean of records_per_second values 2. COMPILE object_details from individual metrics 3. GENERATE optimization_recommendations based on performance patterns 4. CREATE comprehensive_report with: - summary_statistics - object_details - optimization_recommendations 5. RETURN performance_report OUTPUT: comprehensive_performance_analysis ``` ## Tools and Integration ### Supported Tools and Platforms ```yaml ETL_Tools: Talend: - Salesforce connectors - Built-in data quality - Visual job design Informatica: - PowerCenter - Cloud Data Integration - Real-time processing MuleSoft: - Anypoint Platform - API-led connectivity - Real-time synchronization Custom_Solutions: - Python with pandas - Apache Airflow - AWS Glue - Azure Data Factory Monitoring_Tools: - Salesforce Event Monitoring - Custom logging frameworks - APM solutions (New Relic, Datadog) - Database performance monitors ``` ## Success Criteria Source systems analyzed and profiled Extraction strategy designed and documented Performance optimization implemented Data validation framework established Error handling and recovery mechanisms active Monitoring and alerting configured Documentation and runbooks completed Stakeholder sign-off obtained