sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
875 lines (692 loc) • 24.8 kB
Markdown
# Salesforce ETL Best Practices
## Overview
This guide provides comprehensive best practices for Extract, Transform, and
Load (ETL) operations with Salesforce. Following these practices ensures data
integrity, optimal performance, and maintainable integration solutions.
## Planning and Design
### Requirements Analysis
1. **Data Volume Assessment**
- Current data volume and growth projections
- Peak load times and patterns
- Historical data requirements
- Archive and retention policies
2. **Performance Requirements**
- Acceptable latency (real-time vs batch)
- Processing windows
- Throughput requirements
- System availability needs
3. **Data Mapping Documentation**
```yaml
mapping_template:
source_object: CustomerMaster
target_object: Account
fields:
- source: CUST_ID
target: External_Customer_ID__c
transformation: none
required: true
- source: CUST_NAME
target: Name
transformation: proper_case
required: true
- source: REVENUE
target: AnnualRevenue
transformation: currency_conversion
required: false
```
### Architecture Principles
1. **Loose Coupling**
- Use integration middleware
- Implement service abstraction
- Avoid point-to-point integrations
- Enable configuration-driven changes
2. **Scalability**
- Design for horizontal scaling
- Implement load balancing
- Use connection pooling
- Plan for data growth
3. **Fault Tolerance**
- Build retry mechanisms
- Implement circuit breakers
- Design for partial failures
- Maintain transaction integrity
## Data Extraction Best Practices
### API Selection Guide
| Use Case | Recommended API | Volume | Frequency |
| ---------------- | ------------------- | ------------ | ------------ |
| Large data loads | Bulk API 2.0 | >10K records | Daily/Weekly |
| Real-time sync | REST API | <10K records | Continuous |
| Complex queries | SOAP API | Medium | As needed |
| Change tracking | CDC/Platform Events | Any | Real-time |
| Reporting data | Analytics API | Aggregated | Scheduled |
### Query Optimization
1. **Selective Queries**
```sql
-- Good: Selective with indexed fields
SELECT Id, Name, LastModifiedDate
FROM Account
WHERE LastModifiedDate >= YESTERDAY
AND RecordType.DeveloperName = 'Customer'
-- Bad: Non-selective
SELECT Id, Name
FROM Account
WHERE Name LIKE '%Corp%'
```
2. **Relationship Queries**
```sql
-- Efficient: Single query with relationships
SELECT Id, Name,
(SELECT Id, Email FROM Contacts),
Owner.Name
FROM Account
WHERE Type = 'Customer'
-- Inefficient: Multiple queries
-- First query accounts, then query contacts separately
```
3. **Field Selection**
- Query only required fields
- Avoid SELECT \* patterns
- Use field sets for consistency
- Consider formula field impact
### Incremental Loading
```python
# Python example for incremental extraction
def extract_incremental_data(last_run_date):
query = f"""
SELECT Id, Name, LastModifiedDate
FROM Account
WHERE LastModifiedDate > {last_run_date}
OR CreatedDate > {last_run_date}
ORDER BY LastModifiedDate
"""
# Use queryMore for large result sets
results = []
done = False
query_locator = None
while not done:
if query_locator:
response = sf.queryMore(query_locator)
else:
response = sf.query(query)
results.extend(response['records'])
done = response['done']
query_locator = response.get('nextRecordsUrl')
return results
```
## Transformation Best Practices
### Data Quality Rules
1. **Validation Framework**
```python
class DataValidator:
def __init__(self):
self.errors = []
self.warnings = []
def validate_email(self, email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
if not re.match(pattern, email):
self.errors.append(f"Invalid email: {email}")
return None
return email.lower()
def validate_phone(self, phone):
# Remove non-numeric characters
cleaned = re.sub(r'[^0-9]', '', phone)
if len(cleaned) not in [10, 11]:
self.warnings.append(f"Unusual phone length: {phone}")
return self.format_phone(cleaned)
```
2. **Standardization Rules**
- Consistent date formats (ISO 8601)
- Normalized phone numbers
- Standardized addresses
- Unified picklist values
- Consistent naming conventions
### Transformation Patterns
1. **Lookup Transformations**
```python
# Build lookup cache for performance
def build_lookup_cache(object_name, key_field, value_field):
cache = {}
query = f"SELECT {key_field}, {value_field} FROM {object_name}"
for record in sf.query_all(query)['records']:
cache[record[key_field]] = record[value_field]
return cache
# Use cache for transformations
account_cache = build_lookup_cache('Account', 'External_ID__c', 'Id')
def transform_contact(source_contact):
return {
'FirstName': source_contact['first_name'],
'LastName': source_contact['last_name'],
'AccountId': account_cache.get(source_contact['company_id'])
}
```
2. **Complex Business Logic**
```python
def calculate_customer_tier(customer_data):
revenue = customer_data.get('annual_revenue', 0)
employee_count = customer_data.get('employees', 0)
years_customer = customer_data.get('years_active', 0)
score = 0
score += min(revenue / 100000, 50) # Up to 50 points
score += min(employee_count / 100, 30) # Up to 30 points
score += min(years_customer * 4, 20) # Up to 20 points
if score >= 80:
return 'Platinum'
elif score >= 60:
return 'Gold'
elif score >= 40:
return 'Silver'
else:
return 'Bronze'
```
### Error Handling in Transformations
```python
def safe_transform(record, transformation_func):
try:
return transformation_func(record), None
except Exception as e:
error = {
'record_id': record.get('id'),
'error': str(e),
'timestamp': datetime.now().isoformat()
}
return None, error
# Batch processing with error collection
def transform_batch(records, transformation_func):
transformed = []
errors = []
for record in records:
result, error = safe_transform(record, transformation_func)
if error:
errors.append(error)
elif result:
transformed.append(result)
return transformed, errors
```
## Loading Best Practices
### Bulk API Usage
1. **Optimal Batch Sizing**
```python
def create_batches(records, batch_size=10000):
"""Create optimally sized batches for Bulk API"""
for i in range(0, len(records), batch_size):
yield records[i:i + batch_size]
def load_with_bulk_api(records, object_name):
job = bulk_api.create_job(object_name, 'insert')
for batch in create_batches(records):
bulk_api.post_batch(job, batch)
bulk_api.close_job(job)
return monitor_job(job)
```
2. **Parallel Processing**
```python
from concurrent.futures import ThreadPoolExecutor
import threading
class BulkLoader:
def __init__(self, max_workers=5):
self.executor = ThreadPoolExecutor(max_workers=max_workers)
self.api_calls = 0
self.api_limit = 5000 # Daily limit
self.lock = threading.Lock()
def check_api_limit(self):
with self.lock:
if self.api_calls >= self.api_limit:
raise Exception("API limit reached")
self.api_calls += 1
def load_parallel(self, data_chunks, object_name):
futures = []
for chunk in data_chunks:
self.check_api_limit()
future = self.executor.submit(
self.load_chunk, chunk, object_name
)
futures.append(future)
return [f.result() for f in futures]
```
### Upsert Strategies
1. **External ID Management**
```python
def prepare_upsert_data(records, external_id_field):
# Ensure external IDs are unique and non-null
seen_ids = set()
clean_records = []
for record in records:
ext_id = record.get(external_id_field)
if not ext_id:
log_error(f"Missing external ID for record: {record}")
continue
if ext_id in seen_ids:
log_error(f"Duplicate external ID: {ext_id}")
continue
seen_ids.add(ext_id)
clean_records.append(record)
return clean_records
```
2. **Relationship Loading**
```json
// Using external IDs for relationships
{
"Name": "John Doe",
"Email": "john@example.com",
"Account__r": {
"External_ID__c": "EXT-12345"
},
"Manager__r": {
"Employee_ID__c": "EMP-67890"
}
}
```
### Transaction Management
```python
class TransactionalLoader:
def __init__(self, sf_connection):
self.sf = sf_connection
self.transaction_log = []
def load_with_rollback(self, operations):
"""
Load data with ability to rollback on failure
operations: [(object_name, records, operation_type)]
"""
completed = []
try:
for obj_name, records, op_type in operations:
if op_type == 'insert':
results = self.sf.bulk.__getattr__(obj_name).insert(records)
elif op_type == 'update':
results = self.sf.bulk.__getattr__(obj_name).update(records)
elif op_type == 'upsert':
results = self.sf.bulk.__getattr__(obj_name).upsert(
records, external_id_field
)
# Check for errors
errors = [r for r in results if not r['success']]
if errors:
raise Exception(f"Load failed: {errors}")
completed.append((obj_name, results))
self.log_transaction(obj_name, op_type, len(records))
return completed
except Exception as e:
# Rollback completed operations
self.rollback(completed)
raise e
def rollback(self, completed_operations):
"""Delete inserted records in reverse order"""
for obj_name, results in reversed(completed_operations):
ids_to_delete = [r['id'] for r in results if r['success']]
if ids_to_delete:
self.sf.bulk.__getattr__(obj_name).delete(ids_to_delete)
```
## Performance Optimization
### API Limit Management
```python
class APILimitManager:
def __init__(self, sf_connection):
self.sf = sf_connection
self.check_limits()
def check_limits(self):
limits = self.sf.limits()
self.daily_api_requests = limits['DailyApiRequests']
used = self.daily_api_requests['Max'] - self.daily_api_requests['Remaining']
usage_percent = (used / self.daily_api_requests['Max']) * 100
if usage_percent > 80:
self.switch_to_bulk_api()
elif usage_percent > 95:
raise Exception("API limit critical")
def switch_to_bulk_api(self):
logger.info("Switching to Bulk API due to limit constraints")
self.use_bulk = True
```
### Caching Strategies
```python
from functools import lru_cache
import hashlib
class DataCache:
def __init__(self, ttl_seconds=3600):
self.cache = {}
self.ttl = ttl_seconds
def get_or_fetch(self, key, fetch_function):
if key in self.cache:
value, timestamp = self.cache[key]
if time.time() - timestamp < self.ttl:
return value
value = fetch_function()
self.cache[key] = (value, time.time())
return value
def lookup_record_type(self, object_name, developer_name):
"""Cached record type lookup"""
query = f"""
SELECT Id FROM RecordType
WHERE SObjectType = '{object_name}'
AND DeveloperName = '{developer_name}'
"""
result = self.sf.query(query)
return result['records'][0]['Id'] if result['records'] else None
```
### Monitoring and Metrics
```python
class ETLMetricsCollector:
def __init__(self):
self.metrics = {
'start_time': None,
'end_time': None,
'records_processed': 0,
'errors': 0,
'api_calls': 0,
'transformations': {}
}
def record_transformation(self, transform_name, duration, record_count):
if transform_name not in self.metrics['transformations']:
self.metrics['transformations'][transform_name] = {
'total_duration': 0,
'total_records': 0,
'invocations': 0
}
stats = self.metrics['transformations'][transform_name]
stats['total_duration'] += duration
stats['total_records'] += record_count
stats['invocations'] += 1
def generate_report(self):
duration = self.metrics['end_time'] - self.metrics['start_time']
report = {
'summary': {
'duration_seconds': duration.total_seconds(),
'records_per_second': self.metrics['records_processed'] / duration.total_seconds(),
'error_rate': self.metrics['errors'] / self.metrics['records_processed'],
'api_efficiency': self.metrics['records_processed'] / self.metrics['api_calls']
},
'transformations': {}
}
for name, stats in self.metrics['transformations'].items():
report['transformations'][name] = {
'avg_duration': stats['total_duration'] / stats['invocations'],
'avg_records': stats['total_records'] / stats['invocations']
}
return report
```
## Error Handling and Recovery
### Comprehensive Error Strategy
```python
class ETLErrorHandler:
def __init__(self, config):
self.max_retries = config.get('max_retries', 3)
self.retry_delay = config.get('retry_delay', 60)
self.error_threshold = config.get('error_threshold', 0.05)
self.dead_letter_queue = []
def handle_batch_errors(self, batch_results, batch_data):
errors = []
successful = []
for i, result in enumerate(batch_results):
if result['success']:
successful.append(result)
else:
error_record = {
'data': batch_data[i],
'error': result['errors'],
'attempt': 1
}
errors.append(error_record)
# Check error threshold
error_rate = len(errors) / len(batch_results)
if error_rate > self.error_threshold:
raise Exception(f"Error rate {error_rate} exceeds threshold")
# Process errors
self.process_errors(errors)
return successful
def process_errors(self, errors):
for error in errors:
if self.is_retryable(error['error']):
self.retry_queue.append(error)
else:
self.dead_letter_queue.append(error)
self.log_permanent_failure(error)
def is_retryable(self, error):
retryable_errors = [
'UNABLE_TO_LOCK_ROW',
'REQUEST_LIMIT_EXCEEDED',
'SERVER_UNAVAILABLE'
]
return any(e in str(error) for e in retryable_errors)
```
### Dead Letter Queue Processing
```python
def process_dead_letter_queue(dlq_records):
"""
Process records that failed permanently
"""
# Group by error type
error_groups = {}
for record in dlq_records:
error_type = record['error'][0]['statusCode']
if error_type not in error_groups:
error_groups[error_type] = []
error_groups[error_type].append(record)
# Generate report
report = {
'timestamp': datetime.now().isoformat(),
'total_failures': len(dlq_records),
'error_summary': {}
}
for error_type, records in error_groups.items():
report['error_summary'][error_type] = {
'count': len(records),
'sample_errors': records[:5] # First 5 examples
}
# Save to error log
save_error_report(report)
# Notify administrators
send_error_notification(report)
```
## Security Best Practices
### Credential Management
```python
import keyring
from cryptography.fernet import Fernet
class SecureCredentialManager:
def __init__(self):
self.service_name = "salesforce_etl"
def store_credential(self, username, password):
"""Securely store credentials"""
keyring.set_password(self.service_name, username, password)
def get_credential(self, username):
"""Retrieve credentials securely"""
return keyring.get_password(self.service_name, username)
def get_oauth_token(self, config):
"""OAuth 2.0 flow for Salesforce"""
auth_url = f"{config['instance']}/services/oauth2/token"
data = {
'grant_type': 'password',
'client_id': config['client_id'],
'client_secret': self.get_credential('client_secret'),
'username': config['username'],
'password': self.get_credential('password')
}
response = requests.post(auth_url, data=data)
return response.json()['access_token']
```
### Data Encryption
```python
class DataEncryption:
def __init__(self, key=None):
self.key = key or Fernet.generate_key()
self.cipher = Fernet(self.key)
def encrypt_sensitive_fields(self, record, sensitive_fields):
"""Encrypt specific fields before transmission"""
encrypted_record = record.copy()
for field in sensitive_fields:
if field in encrypted_record and encrypted_record[field]:
value = str(encrypted_record[field]).encode()
encrypted_record[field] = self.cipher.encrypt(value).decode()
return encrypted_record
def decrypt_sensitive_fields(self, record, sensitive_fields):
"""Decrypt fields after retrieval"""
decrypted_record = record.copy()
for field in sensitive_fields:
if field in decrypted_record and decrypted_record[field]:
value = decrypted_record[field].encode()
decrypted_record[field] = self.cipher.decrypt(value).decode()
return decrypted_record
```
## Testing and Validation
### ETL Testing Framework
```python
import unittest
from unittest.mock import Mock, patch
class ETLTestFramework(unittest.TestCase):
def setUp(self):
self.test_data = [
{'id': '1', 'name': 'Test Account', 'revenue': 100000},
{'id': '2', 'name': 'Another Account', 'revenue': 200000}
]
def test_transformation_logic(self):
"""Test data transformation functions"""
transformer = DataTransformer()
for record in self.test_data:
transformed = transformer.transform_account(record)
# Assert transformations
self.assertEqual(transformed['Name'], record['name'])
self.assertEqual(transformed['AnnualRevenue'], record['revenue'])
self.assertIn('External_ID__c', transformed)
def test_error_handling(self):
"""Test error handling logic"""
error_handler = ETLErrorHandler({'max_retries': 3})
# Test retryable error
error = {'statusCode': 'UNABLE_TO_LOCK_ROW'}
self.assertTrue(error_handler.is_retryable(error))
# Test non-retryable error
error = {'statusCode': 'INVALID_FIELD'}
self.assertFalse(error_handler.is_retryable(error))
def test_bulk_load(self, mock_bulk):
"""Test bulk loading process"""
mock_bulk.insert.return_value = [
{'success': True, 'id': '001XX000003DHP0'},
{'success': True, 'id': '001XX000003DHP1'}
]
loader = BulkLoader()
results = loader.load_accounts(self.test_data)
self.assertEqual(len(results), 2)
self.assertTrue(all(r['success'] for r in results))
```
### Data Validation Tests
```python
def validate_etl_results(source_data, target_data, mapping_rules):
"""
Validate ETL transformation results
"""
validation_results = {
'passed': 0,
'failed': 0,
'errors': []
}
for i, source_record in enumerate(source_data):
target_record = target_data[i]
for rule in mapping_rules:
source_value = source_record.get(rule['source'])
target_value = target_record.get(rule['target'])
# Apply transformation for comparison
expected_value = apply_transformation(
source_value,
rule.get('transformation')
)
if target_value != expected_value:
validation_results['failed'] += 1
validation_results['errors'].append({
'record_id': source_record.get('id'),
'field': rule['target'],
'expected': expected_value,
'actual': target_value
})
else:
validation_results['passed'] += 1
return validation_results
```
## Documentation Standards
### ETL Process Documentation
```yaml
# ETL Process Documentation Template
etl_process:
name: Customer Data Sync
version: 1.0
last_updated: 2024-01-15
owner: Data Integration Team
overview: |
Synchronizes customer data from ERP to Salesforce
Runs daily at 2 AM EST
source_systems:
- name: SAP ERP
connection: sap_prod
objects:
- KNA1 (Customer Master)
- KNVV (Sales Area Data)
target_system:
name: Salesforce Production
connection: sf_prod
objects:
- Account
- Contact
transformations:
- name: Customer to Account
source: KNA1
target: Account
rules:
- Proper case for names
- Phone number formatting
- Address standardization
error_handling:
retry_attempts: 3
retry_delay: 300
error_notification: data-team@company.com
monitoring:
success_metric: 99% records processed
performance_metric: < 2 hours completion
alerts:
- Error rate > 5%
- Duration > 3 hours
- API limit > 80%
```
## Maintenance and Operations
### Regular Maintenance Tasks
1. **Daily Tasks**
- Monitor job execution logs
- Check error rates
- Verify data quality metrics
- Review API usage
2. **Weekly Tasks**
- Analyze performance trends
- Review and process DLQ
- Update transformation rules
- Optimize slow queries
3. **Monthly Tasks**
- Full data validation
- Performance optimization
- Security audit
- Documentation updates
### Operational Runbook
```markdown
# ETL Operations Runbook
## Job Failure Response
1. Check job logs for error details
2. Verify source system availability
3. Check Salesforce system status
4. Review API limits
5. Restart job if transient issue
6. Escalate if persistent failure
## Performance Degradation
1. Check data volume changes
2. Review query performance
3. Analyze transformation bottlenecks
4. Verify network connectivity
5. Scale resources if needed
## Data Quality Issues
1. Run validation reports
2. Identify affected records
3. Determine root cause
4. Fix at source if possible
5. Run correction procedures
6. Update validation rules
```
## Additional Resources
- [Salesforce Bulk API Developer Guide](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/)
- [Integration Patterns and Practices](https://developer.salesforce.com/docs/atlas.en-us.integration_patterns_and_practices.meta/)
- [Large Data Volume Best Practices](https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/)
- [ETL Tools Comparison Guide](https://developer.salesforce.com/docs/atlas.en-us.bigobjects.meta/)