sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
529 lines (400 loc) • 11.3 kB
Markdown
# ETL Patterns for Salesforce
## Overview
ETL (Extract, Transform, Load) patterns provide proven approaches for moving and
transforming data into and out of Salesforce while maintaining data quality and
system performance.
## Core ETL Concepts
### ETL vs ELT
**ETL (Extract, Transform, Load)**:
- Transform data before loading
- Processing happens in middleware
- Better for complex transformations
- Reduces load on target system
**ELT (Extract, Load, Transform)**:
- Load raw data first
- Transform within Salesforce
- Leverages platform capabilities
- Simpler architecture
### Key Considerations
- **Volume**: Data quantity impacts approach
- **Velocity**: Speed requirements
- **Variety**: Data types and sources
- **Veracity**: Quality requirements
- **Value**: Business importance
## Common ETL Patterns
### Pattern 1: Batch Integration
**Use Case**: Large volume, non-real-time data sync
**Implementation**:
```
1. Extract: Query source system
2. Transform: Apply business rules
3. Stage: Temporary storage
4. Validate: Quality checks
5. Load: Bulk API operation
6. Verify: Post-load validation
```
**Best Practices**:
- Use Bulk API for >10k records
- Implement parallel processing
- Handle errors gracefully
- Monitor API limits
- Schedule during off-hours
### Pattern 2: Real-Time Sync
**Use Case**: Immediate data synchronization
**Implementation**:
```
1. Trigger: Source system event
2. Extract: Get changed data
3. Transform: Apply mappings
4. Load: REST/SOAP API call
5. Confirm: Acknowledge receipt
```
**Best Practices**:
- Use Platform Events for high volume
- Implement circuit breakers
- Queue for resilience
- Monitor latency
- Handle failures gracefully
### Pattern 3: Change Data Capture (CDC)
**Use Case**: Sync only changed records
**Implementation**:
```
1. Enable: CDC on objects
2. Subscribe: Change events
3. Process: Handle changes
4. Transform: Apply logic
5. Update: Target system
```
**Best Practices**:
- Filter unnecessary changes
- Handle redelivery
- Maintain event order
- Monitor event volume
- Plan retention period
### Pattern 4: Master Data Management
**Use Case**: Salesforce as single source of truth
**Implementation**:
```
1. Identify: Master records
2. Match: Find duplicates
3. Merge: Consolidate data
4. Enrich: Add missing data
5. Distribute: Sync to systems
```
**Best Practices**:
- Define match rules clearly
- Implement survivorship rules
- Maintain audit trail
- Handle conflicts
- Monitor data quality
## Data Extraction Patterns
### Query-Based Extraction
**SOQL for Targeted Data**:
```sql
SELECT Id, Name, LastModifiedDate
FROM Account
WHERE LastModifiedDate > :lastRunDate
AND Type = 'Customer'
```
**Considerations**:
- Query optimization
- Selective filters
- Relationship queries
- Governor limits
- Pagination handling
### Bulk Data Extraction
**Bulk API Usage**:
```python
# Python example
job = bulk.create_query_job("Account")
batch = bulk.query(job, "SELECT Id, Name FROM Account")
while not bulk.is_batch_done(batch):
time.sleep(10)
results = bulk.get_all_results_for_query_batch(batch)
```
**Best Practices**:
- Chunk large datasets
- Handle timeouts
- Monitor job status
- Process results streaming
- Clean up completed jobs
### Report/Analytics API
**Extract Aggregated Data**:
```
1. Define report criteria
2. Execute report via API
3. Parse results
4. Transform as needed
5. Load to target
```
**Use Cases**:
- Summary data extraction
- Complex calculations
- Cross-object aggregations
- Trending analysis
## Transformation Patterns
### Field Mapping
**Simple Mapping**:
```json
{
"source_field": "target_field",
"FirstName": "First_Name__c",
"LastName": "Last_Name__c",
"Email": "Email__c"
}
```
**Complex Mapping**:
```javascript
// Concatenation
target.Full_Name__c = source.FirstName + ' ' + source.LastName;
// Lookup transformation
target.Account__c = lookupAccountId(source.CompanyName);
// Conditional logic
target.Status__c = source.IsActive ? 'Active' : 'Inactive';
```
### Data Type Conversion
**Common Conversions**:
```javascript
// String to Date
target.Birth_Date__c = Date.parse(source.DOB);
// Number formatting
target.Revenue__c = parseFloat(source.Revenue.replace(/[^0-9.]/g, ''));
// Boolean conversion
target.Is_Active__c = source.Status === 'Active';
// Picklist mapping
target.Type__c = mapPicklistValue(source.CustomerType);
```
### Data Quality Transformations
**Cleansing Operations**:
```javascript
// Standardize phone
target.Phone = formatPhone(source.Phone);
// Clean email
target.Email = source.Email.toLowerCase().trim();
// Standardize company name
target.Account_Name__c = standardizeCompanyName(source.Company);
// Address formatting
target.Billing_Address__c = formatAddress(source.Address);
```
### Business Rule Application
**Complex Logic**:
```javascript
// Lead scoring
target.Lead_Score__c = calculateLeadScore({
title: source.Title,
company_size: source.Employees,
industry: source.Industry,
});
// Territory assignment
target.Territory__c = assignTerritory({
state: source.State,
revenue: source.Annual_Revenue,
});
// Categorization
target.Customer_Segment__c = determineSegment(source);
```
## Loading Patterns
### Bulk API Loading
**Optimal for Large Volumes**:
```python
def bulk_load_records(records, object_name):
job = bulk.create_insert_job(object_name)
batches = []
# Create batches of 10,000 records
for i in range(0, len(records), 10000):
batch = records[i:i+10000]
batches.append(bulk.post_batch(job, batch))
# Monitor completion
bulk.wait_for_batch(job, batches)
# Check results
for batch in batches:
results = bulk.get_batch_results(batch)
process_results(results)
```
### Upsert Operations
**Insert or Update Based on External ID**:
```apex
List<Account> accounts = new List<Account>();
for(ExternalData__c ext : externalData) {
accounts.add(new Account(
External_ID__c = ext.Id,
Name = ext.CompanyName,
// ... other fields
));
}
Database.upsert(accounts, Account.External_ID__c, false);
```
### Relationship Loading
**Parent-Child Relationships**:
```json
// Using external IDs
{
"Name": "John Doe",
"Account__r": {
"External_ID__c": "EXT-12345"
}
}
// Using Salesforce IDs
{
"Name": "Opportunity ABC",
"AccountId": "001XX000003DHPh"
}
```
## Error Handling Patterns
### Retry Logic
```javascript
const maxRetries = 3;
const retryDelay = 5000; // 5 seconds
async function loadWithRetry(data, attempt = 1) {
try {
return await salesforceAPI.insert(data);
} catch (error) {
if (attempt < maxRetries && isRetryable(error)) {
await sleep(retryDelay * attempt);
return loadWithRetry(data, attempt + 1);
}
throw error;
}
}
```
### Error Logging
```apex
public class ETLErrorLogger {
public static void logError(String process, String record, Exception e) {
ETL_Error_Log__c errorLog = new ETL_Error_Log__c(
Process__c = process,
Record_Identifier__c = record,
Error_Message__c = e.getMessage(),
Stack_Trace__c = e.getStackTraceString(),
Timestamp__c = DateTime.now()
);
insert errorLog;
}
}
```
### Dead Letter Queue
```javascript
// Failed records go to dead letter queue
function processWithDLQ(records) {
const failed = [];
for (const record of records) {
try {
processRecord(record);
} catch (error) {
failed.push({
record: record,
error: error.message,
timestamp: new Date(),
});
}
}
if (failed.length > 0) {
moveToDeadLetterQueue(failed);
}
}
```
## Performance Optimization
### Parallel Processing
```python
import concurrent.futures
import math
def parallel_load(records, num_threads=5):
chunk_size = math.ceil(len(records) / num_threads)
chunks = [records[i:i+chunk_size]
for i in range(0, len(records), chunk_size)]
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
futures = [executor.submit(load_chunk, chunk) for chunk in chunks]
results = [f.result() for f in concurrent.futures.as_completed(futures)]
return results
```
### Bulk API Best Practices
```javascript
// Optimal batch sizing
const OPTIMAL_BATCH_SIZE = 10000;
const MAX_BATCHES_PER_JOB = 100;
// Compression for large payloads
const compressedData = gzip(JSON.stringify(records));
// Binary attachments handling
const attachments = records.filter((r) => r.hasAttachment);
const regularRecords = records.filter((r) => !r.hasAttachment);
// Process separately for performance
processBulkRecords(regularRecords);
processAttachments(attachments);
```
## Monitoring and Logging
### ETL Job Monitoring
```apex
public class ETLJobMonitor {
public static void startJob(String jobName) {
ETL_Job__c job = new ETL_Job__c(
Name = jobName,
Status__c = 'Running',
Start_Time__c = DateTime.now()
);
insert job;
}
public static void updateProgress(Id jobId, Integer processed, Integer total) {
ETL_Job__c job = new ETL_Job__c(
Id = jobId,
Records_Processed__c = processed,
Total_Records__c = total,
Progress__c = (Decimal)processed / total * 100
);
update job;
}
}
```
### Performance Metrics
```javascript
class ETLMetrics {
constructor() {
this.startTime = Date.now();
this.recordsProcessed = 0;
this.errors = 0;
}
recordProcessed(success = true) {
this.recordsProcessed++;
if (!success) this.errors++;
}
getMetrics() {
const duration = Date.now() - this.startTime;
return {
duration: duration,
recordsPerSecond: this.recordsProcessed / (duration / 1000),
errorRate: this.errors / this.recordsProcessed,
successRate: 1 - this.errors / this.recordsProcessed,
};
}
}
```
## Security Considerations
### Credential Management
- Use Named Credentials
- Implement OAuth where possible
- Rotate API keys regularly
- Encrypt sensitive data in transit
- Use secure storage for credentials
### Data Security
```apex
// Field-level encryption
public String encryptSensitiveData(String data) {
Blob key = Crypto.generateAesKey(256);
Blob encrypted = Crypto.encryptWithManagedIV('AES256', key, Blob.valueOf(data));
return EncodingUtil.base64Encode(encrypted);
}
// Masking sensitive data in logs
public String maskSensitiveData(String data) {
return data.substring(0, 4) + '****' + data.substring(data.length() - 4);
}
```
## Best Practices Summary
1. **Plan Thoroughly**: Understand data volumes and patterns
2. **Use Appropriate APIs**: Bulk for volume, REST for real-time
3. **Handle Errors Gracefully**: Implement retry and logging
4. **Monitor Performance**: Track metrics and optimize
5. **Ensure Data Quality**: Validate before and after
6. **Secure Data**: Encrypt and protect sensitive information
7. **Document Mappings**: Maintain transformation documentation
8. **Test Extensively**: Include edge cases and error scenarios
9. **Plan for Scale**: Design for future growth
10. **Maintain Audit Trail**: Track all data movements