sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
425 lines (321 loc) • 10.4 kB
Markdown
# Salesforce Bulk API Guide
## Overview
The Salesforce Bulk API is designed to make it simple to process data from a few
thousand to millions of records. It's optimized for loading or deleting large
sets of data asynchronously. Use it for large-scale data operations in
Salesforce.
## Key Concepts
### Core Principles
- **Asynchronous Processing**: Operations run in the background, allowing for
large-scale data processing
- **Batch Processing**: Data is processed in batches for optimal performance and
resource usage
- **Parallel Processing**: Multiple batches can be processed simultaneously for
faster completion
### API Versions
1. **Bulk API 1.0**
- XML or CSV format
- SOAP-based
- Maximum 10,000 records per batch
- Up to 10,000 batches per job
2. **Bulk API 2.0**
- CSV format only
- REST-based
- Simpler implementation
- No batch management required
- Maximum file size: 150 MB
## Implementation Guide
### Bulk API 2.0 Implementation
#### 1. Create a Job
```python
import requests
import json
# Authentication headers
headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}
# Job creation
job_data = {
"object": "Account",
"operation": "insert",
"lineEnding": "LF"
}
response = requests.post(
f'{instance_url}/services/data/v58.0/jobs/ingest',
headers=headers,
data=json.dumps(job_data)
)
job_id = response.json()['id']
```
#### 2. Upload Data
```python
# CSV data
csv_data = """Name,BillingCity,NumberOfEmployees
Acme Corp,San Francisco,1000
Global Tech,New York,5000
"""
headers['Content-Type'] = 'text/csv'
upload_response = requests.put(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/batches',
headers=headers,
data=csv_data
)
```
#### 3. Close Job and Monitor
```python
# Close the job
close_data = {"state": "UploadComplete"}
headers['Content-Type'] = 'application/json'
close_response = requests.patch(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}',
headers=headers,
data=json.dumps(close_data)
)
# Check job status
status_response = requests.get(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}',
headers=headers
)
```
### Bulk API 1.0 Implementation (Legacy)
```java
// Java example using Partner WSDL
BulkConnection bulkConnection = new BulkConnection(config);
JobInfo job = new JobInfo();
job.setObject("Account");
job.setOperation(OperationEnum.insert);
job.setContentType(ContentType.CSV);
job = bulkConnection.createJob(job);
// Create batches
CSVReader csvReader = new CSVReader(inputStream);
List<BatchInfo> batchInfoList = new ArrayList<>();
for (List<String> nextBatch : csvReader) {
BatchInfo batchInfo = bulkConnection.createBatchFromStream(
job,
new ByteArrayInputStream(nextBatch)
);
batchInfoList.add(batchInfo);
}
```
## Best Practices
### 1. **Optimal Batch Sizing**
- Bulk API 2.0: Keep files under 100 MB for best performance
- Bulk API 1.0: 5,000-10,000 records per batch
- Consider record complexity when sizing batches
### 2. **Error Handling**
```python
def process_job_results(job_id):
# Get failed records
failed_results = requests.get(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/failedResults',
headers=headers
)
if failed_results.text:
# Process failures
with open('failed_records.csv', 'w') as f:
f.write(failed_results.text)
# Get unprocessed records
unprocessed_results = requests.get(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/unprocessedrecords',
headers=headers
)
```
### 3. **Performance Optimization**
- Use external ID fields for upsert operations
- Disable triggers, workflows, and validation rules during bulk loads when
appropriate
- Consider using parallel processing for multiple objects
- Schedule jobs during off-peak hours
### 4. **Data Preparation**
```python
# Clean and prepare data
def prepare_csv_data(data):
# Remove special characters
data = data.replace('\r\n', '\n')
# Escape quotes
data = data.replace('"', '""')
# Handle null values
data = data.replace('NULL', '')
return data
```
## Common Operations
### Insert Operation
```python
job_config = {
"object": "Contact",
"operation": "insert"
}
```
### Update Operation
```python
job_config = {
"object": "Account",
"operation": "update"
}
# Requires Id field in CSV
```
### Upsert Operation
```python
job_config = {
"object": "Product2",
"operation": "upsert",
"externalIdFieldName": "Product_Code__c"
}
```
### Delete Operation
```python
job_config = {
"object": "Opportunity",
"operation": "delete"
}
# Only requires Id field
```
### Hard Delete Operation
```python
job_config = {
"object": "Lead",
"operation": "hardDelete"
}
# Requires special permission
```
## Monitoring and Management
### Job Status States
- **Open**: Job created, ready for batches
- **UploadComplete**: Data upload finished
- **InProgress**: Job is processing
- **Aborted**: Job was aborted
- **Failed**: Job failed
- **JobComplete**: All processing complete
### Query Job Status
```python
def monitor_job(job_id):
while True:
response = requests.get(
f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}',
headers=headers
)
job_info = response.json()
state = job_info['state']
print(f"Job State: {state}")
print(f"Records Processed: {job_info['numberRecordsProcessed']}")
print(f"Records Failed: {job_info['numberRecordsFailed']}")
if state in ['JobComplete', 'Failed', 'Aborted']:
break
time.sleep(10) # Check every 10 seconds
```
## Limits and Considerations
### API Limits
| Limit Type | Bulk API 1.0 | Bulk API 2.0 |
| --------------------------- | --------------- | -------------- |
| Max file size | 10 MB per batch | 150 MB per job |
| Max records per batch | 10,000 | N/A |
| Max batches per job | 10,000 | 1 |
| Max characters per field | 32,000 | 32,000 |
| Max jobs open concurrently | 100 | 100 |
| Max time before job timeout | 7 days | 7 days |
### Governor Limits
- Daily Bulk API calls: Based on org edition
- Concurrent Bulk API jobs: 100
- Total size of Bulk API batches queued: 250 MB
## Error Handling Patterns
### Retry Logic
```python
def bulk_operation_with_retry(data, max_retries=3):
for attempt in range(max_retries):
try:
job_id = create_bulk_job()
upload_data(job_id, data)
close_job(job_id)
results = wait_for_completion(job_id)
if results['numberRecordsFailed'] == 0:
return results
else:
# Process failed records
failed_data = get_failed_records(job_id)
data = failed_data # Retry with failed records
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
```
### Common Errors
1. **INVALID_FIELD**: Field doesn't exist or isn't accessible
2. **REQUIRED_FIELD_MISSING**: Required field not provided
3. **DUPLICATE_VALUE**: Unique constraint violation
4. **INVALID_CROSS_REFERENCE_KEY**: Invalid lookup relationship
5. **TOO_MANY_REQUESTS**: API limit exceeded
## Security Best Practices
1. **Authentication**
- Use OAuth 2.0 for authentication
- Rotate tokens regularly
- Store credentials securely
2. **Data Security**
- Encrypt data in transit
- Validate data before upload
- Implement field-level security
3. **Access Control**
- Use integration user with minimal permissions
- Implement IP restrictions
- Enable login forensics
## Performance Tuning
### Optimization Techniques
1. **Parallel Processing**
```python
from concurrent.futures import ThreadPoolExecutor
def process_multiple_objects(object_data_map):
with ThreadPoolExecutor(max_workers=5) as executor:
futures = []
for obj_name, data in object_data_map.items():
future = executor.submit(bulk_load_object, obj_name, data)
futures.append(future)
# Wait for all to complete
for future in futures:
result = future.result()
```
2. **Data Preparation**
- Pre-sort data by parent records
- Group related records
- Remove unnecessary columns
3. **Resource Management**
- Monitor API usage
- Implement backpressure
- Use connection pooling
## Integration Patterns
### ETL Pipeline Integration
```python
class SalesforceBulkETL:
def __init__(self, config):
self.config = config
self.sf_client = self.connect()
def extract_transform_load(self, source_data):
# Transform data
transformed_data = self.transform(source_data)
# Create Bulk API job
job_id = self.create_job('Account', 'upsert', 'External_Id__c')
# Upload in chunks
for chunk in self.chunk_data(transformed_data, 50000):
self.upload_chunk(job_id, chunk)
# Close and monitor
self.close_job(job_id)
return self.monitor_job(job_id)
```
## Troubleshooting
### Common Issues and Solutions
1. **Job Timeout**
- Break large datasets into smaller jobs
- Process during off-peak hours
- Use query-based processing for very large datasets
2. **Memory Issues**
- Stream data instead of loading into memory
- Process in smaller chunks
- Use generators for data transformation
3. **Lock Contention**
- Avoid updating same records concurrently
- Use external IDs for relationships
- Consider record locking strategy
## Additional Resources
- [Salesforce Bulk API Developer Guide](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/)
- [Bulk API 2.0 Quick Start](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/quick_start.htm)
- [Bulk API Best Practices](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/best_practices.htm)
- [Trailhead: Big Object Basics](https://trailhead.salesforce.com/en/content/learn/modules/big_objects)