sf-agent-framework

# Salesforce ETL Best Practices ## Overview This guide provides comprehensive best practices for Extract, Transform, and Load (ETL) operations with Salesforce. Following these practices ensures data integrity, optimal performance, and maintainable integration solutions. ## Planning and Design ### Requirements Analysis 1. **Data Volume Assessment** - Current data volume and growth projections - Peak load times and patterns - Historical data requirements - Archive and retention policies 2. **Performance Requirements** - Acceptable latency (real-time vs batch) - Processing windows - Throughput requirements - System availability needs 3. **Data Mapping Documentation** ```yaml mapping_template: source_object: CustomerMaster target_object: Account fields: - source: CUST_ID target: External_Customer_ID__c transformation: none required: true - source: CUST_NAME target: Name transformation: proper_case required: true - source: REVENUE target: AnnualRevenue transformation: currency_conversion required: false ``` ### Architecture Principles 1. **Loose Coupling** - Use integration middleware - Implement service abstraction - Avoid point-to-point integrations - Enable configuration-driven changes 2. **Scalability** - Design for horizontal scaling - Implement load balancing - Use connection pooling - Plan for data growth 3. **Fault Tolerance** - Build retry mechanisms - Implement circuit breakers - Design for partial failures - Maintain transaction integrity ## Data Extraction Best Practices ### API Selection Guide | Use Case | Recommended API | Volume | Frequency | | ---------------- | ------------------- | ------------ | ------------ | | Large data loads | Bulk API 2.0 | >10K records | Daily/Weekly | | Real-time sync | REST API | <10K records | Continuous | | Complex queries | SOAP API | Medium | As needed | | Change tracking | CDC/Platform Events | Any | Real-time | | Reporting data | Analytics API | Aggregated | Scheduled | ### Query Optimization 1. **Selective Queries** ```sql -- Good: Selective with indexed fields SELECT Id, Name, LastModifiedDate FROM Account WHERE LastModifiedDate >= YESTERDAY AND RecordType.DeveloperName = 'Customer' -- Bad: Non-selective SELECT Id, Name FROM Account WHERE Name LIKE '%Corp%' ``` 2. **Relationship Queries** ```sql -- Efficient: Single query with relationships SELECT Id, Name, (SELECT Id, Email FROM Contacts), Owner.Name FROM Account WHERE Type = 'Customer' -- Inefficient: Multiple queries -- First query accounts, then query contacts separately ``` 3. **Field Selection** - Query only required fields - Avoid SELECT \* patterns - Use field sets for consistency - Consider formula field impact ### Incremental Loading ```python # Python example for incremental extraction def extract_incremental_data(last_run_date): query = f""" SELECT Id, Name, LastModifiedDate FROM Account WHERE LastModifiedDate > {last_run_date} OR CreatedDate > {last_run_date} ORDER BY LastModifiedDate """ # Use queryMore for large result sets results = [] done = False query_locator = None while not done: if query_locator: response = sf.queryMore(query_locator) else: response = sf.query(query) results.extend(response['records']) done = response['done'] query_locator = response.get('nextRecordsUrl') return results ``` ## Transformation Best Practices ### Data Quality Rules 1. **Validation Framework** ```python class DataValidator: def __init__(self): self.errors = [] self.warnings = [] def validate_email(self, email): pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' if not re.match(pattern, email): self.errors.append(f"Invalid email: {email}") return None return email.lower() def validate_phone(self, phone): # Remove non-numeric characters cleaned = re.sub(r'[^0-9]', '', phone) if len(cleaned) not in [10, 11]: self.warnings.append(f"Unusual phone length: {phone}") return self.format_phone(cleaned) ``` 2. **Standardization Rules** - Consistent date formats (ISO 8601) - Normalized phone numbers - Standardized addresses - Unified picklist values - Consistent naming conventions ### Transformation Patterns 1. **Lookup Transformations** ```python # Build lookup cache for performance def build_lookup_cache(object_name, key_field, value_field): cache = {} query = f"SELECT {key_field}, {value_field} FROM {object_name}" for record in sf.query_all(query)['records']: cache[record[key_field]] = record[value_field] return cache # Use cache for transformations account_cache = build_lookup_cache('Account', 'External_ID__c', 'Id') def transform_contact(source_contact): return { 'FirstName': source_contact['first_name'], 'LastName': source_contact['last_name'], 'AccountId': account_cache.get(source_contact['company_id']) } ``` 2. **Complex Business Logic** ```python def calculate_customer_tier(customer_data): revenue = customer_data.get('annual_revenue', 0) employee_count = customer_data.get('employees', 0) years_customer = customer_data.get('years_active', 0) score = 0 score += min(revenue / 100000, 50) # Up to 50 points score += min(employee_count / 100, 30) # Up to 30 points score += min(years_customer * 4, 20) # Up to 20 points if score >= 80: return 'Platinum' elif score >= 60: return 'Gold' elif score >= 40: return 'Silver' else: return 'Bronze' ``` ### Error Handling in Transformations ```python def safe_transform(record, transformation_func): try: return transformation_func(record), None except Exception as e: error = { 'record_id': record.get('id'), 'error': str(e), 'timestamp': datetime.now().isoformat() } return None, error # Batch processing with error collection def transform_batch(records, transformation_func): transformed = [] errors = [] for record in records: result, error = safe_transform(record, transformation_func) if error: errors.append(error) elif result: transformed.append(result) return transformed, errors ``` ## Loading Best Practices ### Bulk API Usage 1. **Optimal Batch Sizing** ```python def create_batches(records, batch_size=10000): """Create optimally sized batches for Bulk API""" for i in range(0, len(records), batch_size): yield records[i:i + batch_size] def load_with_bulk_api(records, object_name): job = bulk_api.create_job(object_name, 'insert') for batch in create_batches(records): bulk_api.post_batch(job, batch) bulk_api.close_job(job) return monitor_job(job) ``` 2. **Parallel Processing** ```python from concurrent.futures import ThreadPoolExecutor import threading class BulkLoader: def __init__(self, max_workers=5): self.executor = ThreadPoolExecutor(max_workers=max_workers) self.api_calls = 0 self.api_limit = 5000 # Daily limit self.lock = threading.Lock() def check_api_limit(self): with self.lock: if self.api_calls >= self.api_limit: raise Exception("API limit reached") self.api_calls += 1 def load_parallel(self, data_chunks, object_name): futures = [] for chunk in data_chunks: self.check_api_limit() future = self.executor.submit( self.load_chunk, chunk, object_name ) futures.append(future) return [f.result() for f in futures] ``` ### Upsert Strategies 1. **External ID Management** ```python def prepare_upsert_data(records, external_id_field): # Ensure external IDs are unique and non-null seen_ids = set() clean_records = [] for record in records: ext_id = record.get(external_id_field) if not ext_id: log_error(f"Missing external ID for record: {record}") continue if ext_id in seen_ids: log_error(f"Duplicate external ID: {ext_id}") continue seen_ids.add(ext_id) clean_records.append(record) return clean_records ``` 2. **Relationship Loading** ```json // Using external IDs for relationships { "Name": "John Doe", "Email": "john@example.com", "Account__r": { "External_ID__c": "EXT-12345" }, "Manager__r": { "Employee_ID__c": "EMP-67890" } } ``` ### Transaction Management ```python class TransactionalLoader: def __init__(self, sf_connection): self.sf = sf_connection self.transaction_log = [] def load_with_rollback(self, operations): """ Load data with ability to rollback on failure operations: [(object_name, records, operation_type)] """ completed = [] try: for obj_name, records, op_type in operations: if op_type == 'insert': results = self.sf.bulk.__getattr__(obj_name).insert(records) elif op_type == 'update': results = self.sf.bulk.__getattr__(obj_name).update(records) elif op_type == 'upsert': results = self.sf.bulk.__getattr__(obj_name).upsert( records, external_id_field ) # Check for errors errors = [r for r in results if not r['success']] if errors: raise Exception(f"Load failed: {errors}") completed.append((obj_name, results)) self.log_transaction(obj_name, op_type, len(records)) return completed except Exception as e: # Rollback completed operations self.rollback(completed) raise e def rollback(self, completed_operations): """Delete inserted records in reverse order""" for obj_name, results in reversed(completed_operations): ids_to_delete = [r['id'] for r in results if r['success']] if ids_to_delete: self.sf.bulk.__getattr__(obj_name).delete(ids_to_delete) ``` ## Performance Optimization ### API Limit Management ```python class APILimitManager: def __init__(self, sf_connection): self.sf = sf_connection self.check_limits() def check_limits(self): limits = self.sf.limits() self.daily_api_requests = limits['DailyApiRequests'] used = self.daily_api_requests['Max'] - self.daily_api_requests['Remaining'] usage_percent = (used / self.daily_api_requests['Max']) * 100 if usage_percent > 80: self.switch_to_bulk_api() elif usage_percent > 95: raise Exception("API limit critical") def switch_to_bulk_api(self): logger.info("Switching to Bulk API due to limit constraints") self.use_bulk = True ``` ### Caching Strategies ```python from functools import lru_cache import hashlib class DataCache: def __init__(self, ttl_seconds=3600): self.cache = {} self.ttl = ttl_seconds def get_or_fetch(self, key, fetch_function): if key in self.cache: value, timestamp = self.cache[key] if time.time() - timestamp < self.ttl: return value value = fetch_function() self.cache[key] = (value, time.time()) return value @lru_cache(maxsize=1000) def lookup_record_type(self, object_name, developer_name): """Cached record type lookup""" query = f""" SELECT Id FROM RecordType WHERE SObjectType = '{object_name}' AND DeveloperName = '{developer_name}' """ result = self.sf.query(query) return result['records'][0]['Id'] if result['records'] else None ``` ### Monitoring and Metrics ```python class ETLMetricsCollector: def __init__(self): self.metrics = { 'start_time': None, 'end_time': None, 'records_processed': 0, 'errors': 0, 'api_calls': 0, 'transformations': {} } def record_transformation(self, transform_name, duration, record_count): if transform_name not in self.metrics['transformations']: self.metrics['transformations'][transform_name] = { 'total_duration': 0, 'total_records': 0, 'invocations': 0 } stats = self.metrics['transformations'][transform_name] stats['total_duration'] += duration stats['total_records'] += record_count stats['invocations'] += 1 def generate_report(self): duration = self.metrics['end_time'] - self.metrics['start_time'] report = { 'summary': { 'duration_seconds': duration.total_seconds(), 'records_per_second': self.metrics['records_processed'] / duration.total_seconds(), 'error_rate': self.metrics['errors'] / self.metrics['records_processed'], 'api_efficiency': self.metrics['records_processed'] / self.metrics['api_calls'] }, 'transformations': {} } for name, stats in self.metrics['transformations'].items(): report['transformations'][name] = { 'avg_duration': stats['total_duration'] / stats['invocations'], 'avg_records': stats['total_records'] / stats['invocations'] } return report ``` ## Error Handling and Recovery ### Comprehensive Error Strategy ```python class ETLErrorHandler: def __init__(self, config): self.max_retries = config.get('max_retries', 3) self.retry_delay = config.get('retry_delay', 60) self.error_threshold = config.get('error_threshold', 0.05) self.dead_letter_queue = [] def handle_batch_errors(self, batch_results, batch_data): errors = [] successful = [] for i, result in enumerate(batch_results): if result['success']: successful.append(result) else: error_record = { 'data': batch_data[i], 'error': result['errors'], 'attempt': 1 } errors.append(error_record) # Check error threshold error_rate = len(errors) / len(batch_results) if error_rate > self.error_threshold: raise Exception(f"Error rate {error_rate} exceeds threshold") # Process errors self.process_errors(errors) return successful def process_errors(self, errors): for error in errors: if self.is_retryable(error['error']): self.retry_queue.append(error) else: self.dead_letter_queue.append(error) self.log_permanent_failure(error) def is_retryable(self, error): retryable_errors = [ 'UNABLE_TO_LOCK_ROW', 'REQUEST_LIMIT_EXCEEDED', 'SERVER_UNAVAILABLE' ] return any(e in str(error) for e in retryable_errors) ``` ### Dead Letter Queue Processing ```python def process_dead_letter_queue(dlq_records): """ Process records that failed permanently """ # Group by error type error_groups = {} for record in dlq_records: error_type = record['error'][0]['statusCode'] if error_type not in error_groups: error_groups[error_type] = [] error_groups[error_type].append(record) # Generate report report = { 'timestamp': datetime.now().isoformat(), 'total_failures': len(dlq_records), 'error_summary': {} } for error_type, records in error_groups.items(): report['error_summary'][error_type] = { 'count': len(records), 'sample_errors': records[:5] # First 5 examples } # Save to error log save_error_report(report) # Notify administrators send_error_notification(report) ``` ## Security Best Practices ### Credential Management ```python import keyring from cryptography.fernet import Fernet class SecureCredentialManager: def __init__(self): self.service_name = "salesforce_etl" def store_credential(self, username, password): """Securely store credentials""" keyring.set_password(self.service_name, username, password) def get_credential(self, username): """Retrieve credentials securely""" return keyring.get_password(self.service_name, username) def get_oauth_token(self, config): """OAuth 2.0 flow for Salesforce""" auth_url = f"{config['instance']}/services/oauth2/token" data = { 'grant_type': 'password', 'client_id': config['client_id'], 'client_secret': self.get_credential('client_secret'), 'username': config['username'], 'password': self.get_credential('password') } response = requests.post(auth_url, data=data) return response.json()['access_token'] ``` ### Data Encryption ```python class DataEncryption: def __init__(self, key=None): self.key = key or Fernet.generate_key() self.cipher = Fernet(self.key) def encrypt_sensitive_fields(self, record, sensitive_fields): """Encrypt specific fields before transmission""" encrypted_record = record.copy() for field in sensitive_fields: if field in encrypted_record and encrypted_record[field]: value = str(encrypted_record[field]).encode() encrypted_record[field] = self.cipher.encrypt(value).decode() return encrypted_record def decrypt_sensitive_fields(self, record, sensitive_fields): """Decrypt fields after retrieval""" decrypted_record = record.copy() for field in sensitive_fields: if field in decrypted_record and decrypted_record[field]: value = decrypted_record[field].encode() decrypted_record[field] = self.cipher.decrypt(value).decode() return decrypted_record ``` ## Testing and Validation ### ETL Testing Framework ```python import unittest from unittest.mock import Mock, patch class ETLTestFramework(unittest.TestCase): def setUp(self): self.test_data = [ {'id': '1', 'name': 'Test Account', 'revenue': 100000}, {'id': '2', 'name': 'Another Account', 'revenue': 200000} ] def test_transformation_logic(self): """Test data transformation functions""" transformer = DataTransformer() for record in self.test_data: transformed = transformer.transform_account(record) # Assert transformations self.assertEqual(transformed['Name'], record['name']) self.assertEqual(transformed['AnnualRevenue'], record['revenue']) self.assertIn('External_ID__c', transformed) def test_error_handling(self): """Test error handling logic""" error_handler = ETLErrorHandler({'max_retries': 3}) # Test retryable error error = {'statusCode': 'UNABLE_TO_LOCK_ROW'} self.assertTrue(error_handler.is_retryable(error)) # Test non-retryable error error = {'statusCode': 'INVALID_FIELD'} self.assertFalse(error_handler.is_retryable(error)) @patch('salesforce_api.bulk') def test_bulk_load(self, mock_bulk): """Test bulk loading process""" mock_bulk.insert.return_value = [ {'success': True, 'id': '001XX000003DHP0'}, {'success': True, 'id': '001XX000003DHP1'} ] loader = BulkLoader() results = loader.load_accounts(self.test_data) self.assertEqual(len(results), 2) self.assertTrue(all(r['success'] for r in results)) ``` ### Data Validation Tests ```python def validate_etl_results(source_data, target_data, mapping_rules): """ Validate ETL transformation results """ validation_results = { 'passed': 0, 'failed': 0, 'errors': [] } for i, source_record in enumerate(source_data): target_record = target_data[i] for rule in mapping_rules: source_value = source_record.get(rule['source']) target_value = target_record.get(rule['target']) # Apply transformation for comparison expected_value = apply_transformation( source_value, rule.get('transformation') ) if target_value != expected_value: validation_results['failed'] += 1 validation_results['errors'].append({ 'record_id': source_record.get('id'), 'field': rule['target'], 'expected': expected_value, 'actual': target_value }) else: validation_results['passed'] += 1 return validation_results ``` ## Documentation Standards ### ETL Process Documentation ```yaml # ETL Process Documentation Template etl_process: name: Customer Data Sync version: 1.0 last_updated: 2024-01-15 owner: Data Integration Team overview: | Synchronizes customer data from ERP to Salesforce Runs daily at 2 AM EST source_systems: - name: SAP ERP connection: sap_prod objects: - KNA1 (Customer Master) - KNVV (Sales Area Data) target_system: name: Salesforce Production connection: sf_prod objects: - Account - Contact transformations: - name: Customer to Account source: KNA1 target: Account rules: - Proper case for names - Phone number formatting - Address standardization error_handling: retry_attempts: 3 retry_delay: 300 error_notification: data-team@company.com monitoring: success_metric: 99% records processed performance_metric: < 2 hours completion alerts: - Error rate > 5% - Duration > 3 hours - API limit > 80% ``` ## Maintenance and Operations ### Regular Maintenance Tasks 1. **Daily Tasks** - Monitor job execution logs - Check error rates - Verify data quality metrics - Review API usage 2. **Weekly Tasks** - Analyze performance trends - Review and process DLQ - Update transformation rules - Optimize slow queries 3. **Monthly Tasks** - Full data validation - Performance optimization - Security audit - Documentation updates ### Operational Runbook ```markdown # ETL Operations Runbook ## Job Failure Response 1. Check job logs for error details 2. Verify source system availability 3. Check Salesforce system status 4. Review API limits 5. Restart job if transient issue 6. Escalate if persistent failure ## Performance Degradation 1. Check data volume changes 2. Review query performance 3. Analyze transformation bottlenecks 4. Verify network connectivity 5. Scale resources if needed ## Data Quality Issues 1. Run validation reports 2. Identify affected records 3. Determine root cause 4. Fix at source if possible 5. Run correction procedures 6. Update validation rules ``` ## Additional Resources - [Salesforce Bulk API Developer Guide](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/) - [Integration Patterns and Practices](https://developer.salesforce.com/docs/atlas.en-us.integration_patterns_and_practices.meta/) - [Large Data Volume Best Practices](https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/) - [ETL Tools Comparison Guide](https://developer.salesforce.com/docs/atlas.en-us.bigobjects.meta/)