sf-agent-framework

# Data Extraction Task This task guides the systematic extraction of data from various sources for Salesforce migration, integration, and analysis purposes. ## Purpose Enable ETL developers to: - Design efficient data extraction processes - Ensure data integrity and completeness - Optimize extraction performance - Handle complex data relationships - Maintain security and compliance ## Prerequisites - Access to source systems and databases - Understanding of source data schemas - Salesforce target data model knowledge - Appropriate security credentials and permissions - Data extraction tools and infrastructure ## Data Extraction Framework ### 1. Source System Analysis **Data Source Assessment** ```yaml Source Systems: Salesforce_Org: Type: Salesforce Production/Sandbox Access: REST API, SOAP API, Bulk API Limits: API call limits, data storage Format: JSON, XML, CSV Database_Systems: Type: SQL Server, Oracle, MySQL, PostgreSQL Access: Direct connection, ODBC, JDBC Constraints: Connection limits, query timeouts Format: Tables, views, stored procedures File_Systems: Type: CSV, JSON, XML, Excel Access: FTP, SFTP, cloud storage Structure: Flat files, hierarchical Format: Delimited, fixed-width Web_Services: Type: REST APIs, SOAP services Authentication: OAuth, API keys, tokens Rate_Limits: Requests per minute/hour Format: JSON, XML responses ``` **Data Profiling and Discovery** **Data Profiling Algorithms** ### Algorithm: Record Distribution Analysis ``` INPUT: objectName (e.g., "Account") PROCESS: 1. INITIALIZE counters: - totalRecords = 0 - uniqueValues = empty set - earliestDate = null - latestDate = null 2. FOR each record in object: - INCREMENT totalRecords - ADD field values to uniqueValues sets - UPDATE earliestDate if record.createdDate < earliestDate - UPDATE latestDate if record.createdDate > latestDate 3. CALCULATE statistics: - uniqueCount = size of uniqueValues set - dateRange = latestDate - earliestDate 4. RETURN profiling results: - Object name - Total record count - Unique value counts per field - Date range of records ``` ### Algorithm: Data Quality Assessment ``` INPUT: objectName, fieldName, qualityRule PROCESS: 1. INITIALIZE: - issueCount = 0 - totalCount = 0 2. FOR each record in object: - INCREMENT totalCount - IF qualityRule(record.fieldName) fails THEN INCREMENT issueCount 3. CALCULATE percentage = (issueCount / totalCount) * 100 4. RETURN quality metrics: - Issue description - Absolute count - Percentage of total ``` ### Algorithm: Relationship Complexity Analysis ``` INPUT: childObject, parentRelationship PROCESS: 1. CREATE relationship map: - parentCounts = dictionary - relationshipGroups = dictionary 2. FOR each child record: - GET parentId from relationship - INCREMENT parentCounts[parentId] - ADD record to relationshipGroups[relationshipName] 3. CALCULATE metrics: - totalRelationships = sum of all counts - uniqueParents = count of keys in parentCounts - averageChildrenPerParent = totalRelationships / uniqueParents 4. RETURN complexity analysis: - Relationship name - Total record count - Unique parent count - Distribution statistics ``` ### 2. Extraction Strategy Design **Extraction Patterns** ```yaml Full_Extract: Use_Cases: - Initial data migration - Complete system refresh - Data archival Considerations: - Large data volumes - Extended processing time - System impact Incremental_Extract: Use_Cases: - Regular synchronization - Change data capture - Real-time updates Methods: - Timestamp-based - Sequence-based - Change log analysis Delta_Extract: Use_Cases: - Modified records only - Efficient updates - Minimal system impact Tracking: - LastModifiedDate - SystemModstamp - Custom change flags ``` **Performance Optimization Strategy** ```json { "extraction_optimization": { "bulk_operations": { "salesforce_bulk_api": "2.0", "batch_size": 10000, "parallel_processing": true, "compression": "gzip" }, "query_optimization": { "selective_queries": true, "indexed_fields": "use in WHERE clauses", "limit_results": "paginate large datasets", "avoid_wildcards": "specify field lists" }, "resource_management": { "connection_pooling": true, "memory_management": "stream processing", "error_handling": "retry logic", "logging": "detailed audit trail" } } } ``` ## Implementation Steps ### Step 1: Environment Setup and Configuration **Connection Configuration** ```python # Salesforce Connection Example import simple_salesforce as sf from simple_salesforce.bulk import SFBulkHandler # OAuth connection setup sf_connection = sf.Salesforce( username='user@company.com', password='password', security_token='token', domain='test' # for sandbox ) # Bulk API handler for large data sets bulk_handler = SFBulkHandler( session_id=sf_connection.session_id, bulk_url=sf_connection.bulk_url ) ``` **Database Connection Setup** ```python import pyodbc import pandas as pd from sqlalchemy import create_engine # SQL Server connection connection_string = ( "DRIVER={ODBC Driver 17 for SQL Server};" "SERVER=server_name;" "DATABASE=database_name;" "UID=username;" "PWD=password" ) conn = pyodbc.connect(connection_string) # PostgreSQL connection with SQLAlchemy engine = create_engine('postgresql://user:password@localhost:5432/dbname') ``` ### Step 2: Data Extraction Implementation **Salesforce Data Extraction** ``` Algorithm: Salesforce Account Data Extraction INPUT: connection_object, batch_size (default: 10000) PROCESS: 1. DEFINE field_list = [Id, Name, Type, Industry, BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry, Phone, Website, AnnualRevenue, NumberOfEmployees, CreatedDate, LastModifiedDate] 2. INCLUDE related_data = [Contact records where IsDeleted = FALSE] 3. BUILD query with field_list and related_data 4. ESTIMATE total_record_count for Account object 5. IF total_record_count < 50000 THEN use standard SOQL query execution 6. ELSE use bulk_api_extraction method 7. RETURN extracted_data_set OUTPUT: account_records_with_related_contacts ``` ``` Algorithm: Bulk API Data Extraction INPUT: object_name, query, batch_size PROCESS: 1. CREATE bulk_extraction_job for object_name 2. SUBMIT query to bulk_api 3. INITIALIZE results = empty_collection 4. WHILE job has remaining batches: a. RETRIEVE next batch from job b. GET batch_results from batch c. APPEND batch_results to results 5. RETURN consolidated results OUTPUT: complete_extracted_dataset ``` **Database Extraction with Change Detection** ``` Algorithm: Incremental Database Extraction INPUT: table_name, timestamp_field, last_extraction_time (optional) PROCESS: 1. BUILD base_query = "SELECT * FROM " + table_name 2. IF last_extraction_time is provided THEN a. ADD WHERE clause: timestamp_field > last_extraction_time b. ADD ORDER BY timestamp_field 3. ELSE ADD ORDER BY timestamp_field only 4. EXECUTE query against database connection 5. RETURN query_results as structured dataset OUTPUT: extracted_records_since_last_extraction ``` ``` Algorithm: Database Extraction with Integrity Verification INPUT: table_name, key_field PROCESS: 1. BUILD query with all table fields 2. ADD checksum calculation for row integrity 3. ADD ORDER BY key_field for consistent results 4. EXECUTE query against database connection 5. CALCULATE row_checksum for each record 6. RETURN dataset with original_data and checksum_verification OUTPUT: records_with_integrity_checksums ``` ### Step 3: Data Relationship Handling **Hierarchical Data Extraction** ``` Algorithm: Account Hierarchy Extraction INPUT: salesforce_connection PROCESS: 1. DEFINE field_list = [Id, Name, ParentId, Type, Industry, CreatedDate, LastModifiedDate] 2. BUILD query to extract all non-deleted accounts 3. ORDER results by ParentId (nulls first), then by Name 4. EXECUTE query and retrieve all account records 5. INITIALIZE hierarchy structure: - root_accounts = empty list - child_accounts = empty dictionary - orphaned_accounts = empty list 6. FOR each account in extracted records: a. IF account.ParentId is null THEN ADD account to root_accounts b. ELSE IF parent_id not in child_accounts THEN CREATE new list for parent_id ADD account to child_accounts[parent_id] 7. RETURN structured hierarchy OUTPUT: hierarchical_account_structure ``` **Related Object Extraction** ``` Algorithm: Opportunity Ecosystem Extraction INPUT: account_ids_list PROCESS: 1. INITIALIZE opportunities = empty collection 2. FOR each account_id in account_ids_list: a. DEFINE opportunity_fields = [Id, Name, StageName, Amount, CloseDate, AccountId] b. DEFINE related_line_items = [Id, PricebookEntry.Product2.Name, Quantity, UnitPrice, TotalPrice] c. DEFINE related_tasks = [Id, Subject, ActivityDate, Status, WhoId WHERE IsClosed = FALSE] d. DEFINE related_events = [Id, Subject, ActivityDateTime, WhoId WHERE ActivityDateTime >= TODAY] e. BUILD query including opportunity_fields and all related objects f. ADD filter: AccountId = current account_id AND IsDeleted = FALSE g. EXECUTE query for current account h. APPEND results to opportunities collection 3. RETURN complete opportunities dataset OUTPUT: opportunities_with_related_objects ``` ## Advanced Extraction Techniques ### Real-time Data Extraction ``` Algorithm: Streaming API Setup for Real-time Updates INPUT: object_name, fields_list PROCESS: 1. CREATE push_topic_configuration: - Name = object_name + "_Updates" - Query = "SELECT " + join(fields_list) + " FROM " + object_name - ApiVersion = current_api_version - NotifyForOperationCreate = true - NotifyForOperationUpdate = true - NotifyForOperationDelete = true - NotifyForFields = "All" 2. SUBMIT push_topic_configuration to Salesforce 3. RECEIVE push_topic_id from creation response 4. RETURN push_topic_id for subscription OUTPUT: streaming_topic_identifier ``` ``` Algorithm: Real-time Change Listener INPUT: push_topic_name, callback_function PROCESS: 1. INITIALIZE streaming_client with: - session_id from salesforce_connection - instance_url from salesforce_connection 2. SUBSCRIBE to streaming topic: "/topic/" + push_topic_name 3. REGISTER callback_function for change notifications 4. START streaming client listener 5. CONTINUOUSLY process incoming change events 6. FOR each change event: EXECUTE callback_function with event_data OUTPUT: continuous_real_time_monitoring ``` ### Large Volume Data Handling ``` Algorithm: Large Dataset Extraction with Pagination INPUT: object_name, query, chunk_size (default: 50000) PROCESS: 1. INITIALIZE all_records = empty collection 2. MODIFY query to include LIMIT chunk_size 3. EXECUTE initial query 4. ADD initial results to all_records 5. WHILE query result indicates more records available: a. GET next_records_url from previous result b. EXECUTE query_more with next_records_url c. ADD new results to all_records d. IF all_records count is multiple of (chunk_size * 5) THEN save intermediate results for recovery 6. RETURN complete all_records collection OUTPUT: complete_large_dataset ``` ``` Algorithm: Parallel Multi-Query Extraction INPUT: queries_dictionary, max_workers (default: 5) PROCESS: 1. INITIALIZE results = empty dictionary 2. CREATE thread_pool with max_workers threads 3. FOR each query_name and query in queries_dictionary: SUBMIT query execution to thread_pool 4. COLLECT completed futures as they finish: a. GET query_name for completed future b. TRY to get future result c. IF successful THEN SET results[query_name] = query_result d. ELSE SET results[query_name] = error_information 5. RETURN results dictionary with all query outcomes OUTPUT: parallel_query_results ``` ## Data Quality and Validation ### Extraction Validation Framework ``` Algorithm: Record Count Validation INPUT: source_count, extracted_count, tolerance (default: 0.01) PROCESS: 1. CALCULATE variance = |source_count - extracted_count| / source_count 2. CALCULATE variance_percentage = variance * 100 3. DETERMINE within_tolerance = (variance <= tolerance) 4. IF within_tolerance THEN SET status = "PASS" 5. ELSE SET status = "FAIL" 6. CREATE validation_result with all metrics 7. RETURN validation_result OUTPUT: count_validation_report ``` ``` Algorithm: Data Integrity Validation INPUT: extracted_data, key_field PROCESS: 1. CONVERT extracted_data to structured format 2. COUNT total_records in dataset 3. COUNT unique_keys in key_field 4. CALCULATE duplicate_count = total_records - unique_keys 5. COUNT null_keys in key_field 6. VERIFY data_types_consistent across records 7. COMPILE integrity_metrics 8. RETURN integrity_validation_report OUTPUT: data_integrity_assessment ``` ``` Algorithm: Referential Integrity Validation INPUT: parent_data, child_data, parent_key, foreign_key PROCESS: 1. EXTRACT parent_ids from parent_data using parent_key 2. EXTRACT foreign_keys from child_data using foreign_key (exclude nulls) 3. IDENTIFY orphaned_records = foreign_keys NOT IN parent_ids 4. COUNT parent_record_count, child_record_count, orphaned_child_count 5. IF child_record_count > 0 THEN CALCULATE integrity_percentage = (1 - orphaned_child_count / child_record_count) * 100 6. ELSE SET integrity_percentage = 100 7. COMPILE referential_integrity_report 8. RETURN validation_results OUTPUT: referential_integrity_assessment ``` ## Error Handling and Recovery ### Robust Extraction Pipeline ``` Algorithm: Extraction with Retry Logic INPUT: extraction_function, function_parameters, retry_attempts (default: 3), backoff_factor (default: 2) PROCESS: 1. FOR attempt = 1 to retry_attempts: a. TRY to execute extraction_function with function_parameters b. IF successful THEN RETURN extraction_result c. IF exception occurs THEN IF attempt = retry_attempts THEN RAISE final exception ELSE CALCULATE wait_time = backoff_factor ^ (attempt - 1) LOG failure message and retry information WAIT for wait_time seconds 2. IF all attempts fail THEN RAISE extraction_failure_exception OUTPUT: successful_extraction_result OR exception ``` ``` Algorithm: Extraction Progress Checkpointing INPUT: extracted_data, checkpoint_id PROCESS: 1. CREATE checkpoint_metadata: - checkpoint_id = provided identifier - timestamp = current datetime - record_count = count of extracted_data - data = extracted_data 2. SERIALIZE checkpoint_metadata to storage format 3. SAVE to file: "checkpoint_" + checkpoint_id + ".json" 4. CONFIRM successful save operation OUTPUT: checkpoint_saved_confirmation ``` ``` Algorithm: Resume from Checkpoint INPUT: checkpoint_id PROCESS: 1. CONSTRUCT checkpoint_filename = "checkpoint_" + checkpoint_id + ".json" 2. TRY to read checkpoint_filename 3. IF file exists THEN a. DESERIALIZE checkpoint_data from file b. RETURN checkpoint_data['data'] 4. ELSE RETURN null (no checkpoint found) OUTPUT: recovered_data OR null ``` ## Performance Monitoring and Optimization ### Extraction Performance Metrics ``` Algorithm: Extraction Performance Tracking INPUT: object_name, start_time, end_time, record_count, data_size_mb PROCESS: 1. CALCULATE duration_seconds = end_time - start_time 2. IF duration_seconds > 0 THEN a. CALCULATE records_per_second = record_count / duration_seconds b. CALCULATE mb_per_second = data_size_mb / duration_seconds 3. ELSE SET records_per_second = 0, mb_per_second = 0 4. CREATE performance_metrics: - object_name, duration_seconds, records_extracted - data_size_mb, records_per_second, mb_per_second - extraction_timestamp = start_time 5. STORE metrics for object_name 6. RETURN performance_metrics OUTPUT: extraction_performance_data ``` ``` Algorithm: Performance Report Generation INPUT: collected_metrics_for_all_objects PROCESS: 1. CALCULATE summary_statistics: - total_objects = count of metrics - total_records = sum of records_extracted across all objects - total_data_size_mb = sum of data_size_mb across all objects - average_records_per_second = mean of records_per_second values 2. COMPILE object_details from individual metrics 3. GENERATE optimization_recommendations based on performance patterns 4. CREATE comprehensive_report with: - summary_statistics - object_details - optimization_recommendations 5. RETURN performance_report OUTPUT: comprehensive_performance_analysis ``` ## Tools and Integration ### Supported Tools and Platforms ```yaml ETL_Tools: Talend: - Salesforce connectors - Built-in data quality - Visual job design Informatica: - PowerCenter - Cloud Data Integration - Real-time processing MuleSoft: - Anypoint Platform - API-led connectivity - Real-time synchronization Custom_Solutions: - Python with pandas - Apache Airflow - AWS Glue - Azure Data Factory Monitoring_Tools: - Salesforce Event Monitoring - Custom logging frameworks - APM solutions (New Relic, Datadog) - Database performance monitors ``` ## Success Criteria ✅ Source systems analyzed and profiled ✅ Extraction strategy designed and documented ✅ Performance optimization implemented ✅ Data validation framework established ✅ Error handling and recovery mechanisms active ✅ Monitoring and alerting configured ✅ Documentation and runbooks completed ✅ Stakeholder sign-off obtained