sf-agent-framework

# Salesforce Bulk API Guide ## Overview The Salesforce Bulk API is designed to make it simple to process data from a few thousand to millions of records. It's optimized for loading or deleting large sets of data asynchronously. Use it for large-scale data operations in Salesforce. ## Key Concepts ### Core Principles - **Asynchronous Processing**: Operations run in the background, allowing for large-scale data processing - **Batch Processing**: Data is processed in batches for optimal performance and resource usage - **Parallel Processing**: Multiple batches can be processed simultaneously for faster completion ### API Versions 1. **Bulk API 1.0** - XML or CSV format - SOAP-based - Maximum 10,000 records per batch - Up to 10,000 batches per job 2. **Bulk API 2.0** - CSV format only - REST-based - Simpler implementation - No batch management required - Maximum file size: 150 MB ## Implementation Guide ### Bulk API 2.0 Implementation #### 1. Create a Job ```python import requests import json # Authentication headers headers = { 'Authorization': f'Bearer {access_token}', 'Content-Type': 'application/json' } # Job creation job_data = { "object": "Account", "operation": "insert", "lineEnding": "LF" } response = requests.post( f'{instance_url}/services/data/v58.0/jobs/ingest', headers=headers, data=json.dumps(job_data) ) job_id = response.json()['id'] ``` #### 2. Upload Data ```python # CSV data csv_data = """Name,BillingCity,NumberOfEmployees Acme Corp,San Francisco,1000 Global Tech,New York,5000 """ headers['Content-Type'] = 'text/csv' upload_response = requests.put( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/batches', headers=headers, data=csv_data ) ``` #### 3. Close Job and Monitor ```python # Close the job close_data = {"state": "UploadComplete"} headers['Content-Type'] = 'application/json' close_response = requests.patch( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}', headers=headers, data=json.dumps(close_data) ) # Check job status status_response = requests.get( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}', headers=headers ) ``` ### Bulk API 1.0 Implementation (Legacy) ```java // Java example using Partner WSDL BulkConnection bulkConnection = new BulkConnection(config); JobInfo job = new JobInfo(); job.setObject("Account"); job.setOperation(OperationEnum.insert); job.setContentType(ContentType.CSV); job = bulkConnection.createJob(job); // Create batches CSVReader csvReader = new CSVReader(inputStream); List<BatchInfo> batchInfoList = new ArrayList<>(); for (List<String> nextBatch : csvReader) { BatchInfo batchInfo = bulkConnection.createBatchFromStream( job, new ByteArrayInputStream(nextBatch) ); batchInfoList.add(batchInfo); } ``` ## Best Practices ### 1. **Optimal Batch Sizing** - Bulk API 2.0: Keep files under 100 MB for best performance - Bulk API 1.0: 5,000-10,000 records per batch - Consider record complexity when sizing batches ### 2. **Error Handling** ```python def process_job_results(job_id): # Get failed records failed_results = requests.get( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/failedResults', headers=headers ) if failed_results.text: # Process failures with open('failed_records.csv', 'w') as f: f.write(failed_results.text) # Get unprocessed records unprocessed_results = requests.get( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}/unprocessedrecords', headers=headers ) ``` ### 3. **Performance Optimization** - Use external ID fields for upsert operations - Disable triggers, workflows, and validation rules during bulk loads when appropriate - Consider using parallel processing for multiple objects - Schedule jobs during off-peak hours ### 4. **Data Preparation** ```python # Clean and prepare data def prepare_csv_data(data): # Remove special characters data = data.replace('\r\n', '\n') # Escape quotes data = data.replace('"', '""') # Handle null values data = data.replace('NULL', '') return data ``` ## Common Operations ### Insert Operation ```python job_config = { "object": "Contact", "operation": "insert" } ``` ### Update Operation ```python job_config = { "object": "Account", "operation": "update" } # Requires Id field in CSV ``` ### Upsert Operation ```python job_config = { "object": "Product2", "operation": "upsert", "externalIdFieldName": "Product_Code__c" } ``` ### Delete Operation ```python job_config = { "object": "Opportunity", "operation": "delete" } # Only requires Id field ``` ### Hard Delete Operation ```python job_config = { "object": "Lead", "operation": "hardDelete" } # Requires special permission ``` ## Monitoring and Management ### Job Status States - **Open**: Job created, ready for batches - **UploadComplete**: Data upload finished - **InProgress**: Job is processing - **Aborted**: Job was aborted - **Failed**: Job failed - **JobComplete**: All processing complete ### Query Job Status ```python def monitor_job(job_id): while True: response = requests.get( f'{instance_url}/services/data/v58.0/jobs/ingest/{job_id}', headers=headers ) job_info = response.json() state = job_info['state'] print(f"Job State: {state}") print(f"Records Processed: {job_info['numberRecordsProcessed']}") print(f"Records Failed: {job_info['numberRecordsFailed']}") if state in ['JobComplete', 'Failed', 'Aborted']: break time.sleep(10) # Check every 10 seconds ``` ## Limits and Considerations ### API Limits | Limit Type | Bulk API 1.0 | Bulk API 2.0 | | --------------------------- | --------------- | -------------- | | Max file size | 10 MB per batch | 150 MB per job | | Max records per batch | 10,000 | N/A | | Max batches per job | 10,000 | 1 | | Max characters per field | 32,000 | 32,000 | | Max jobs open concurrently | 100 | 100 | | Max time before job timeout | 7 days | 7 days | ### Governor Limits - Daily Bulk API calls: Based on org edition - Concurrent Bulk API jobs: 100 - Total size of Bulk API batches queued: 250 MB ## Error Handling Patterns ### Retry Logic ```python def bulk_operation_with_retry(data, max_retries=3): for attempt in range(max_retries): try: job_id = create_bulk_job() upload_data(job_id, data) close_job(job_id) results = wait_for_completion(job_id) if results['numberRecordsFailed'] == 0: return results else: # Process failed records failed_data = get_failed_records(job_id) data = failed_data # Retry with failed records except Exception as e: if attempt == max_retries - 1: raise time.sleep(2 ** attempt) # Exponential backoff ``` ### Common Errors 1. **INVALID_FIELD**: Field doesn't exist or isn't accessible 2. **REQUIRED_FIELD_MISSING**: Required field not provided 3. **DUPLICATE_VALUE**: Unique constraint violation 4. **INVALID_CROSS_REFERENCE_KEY**: Invalid lookup relationship 5. **TOO_MANY_REQUESTS**: API limit exceeded ## Security Best Practices 1. **Authentication** - Use OAuth 2.0 for authentication - Rotate tokens regularly - Store credentials securely 2. **Data Security** - Encrypt data in transit - Validate data before upload - Implement field-level security 3. **Access Control** - Use integration user with minimal permissions - Implement IP restrictions - Enable login forensics ## Performance Tuning ### Optimization Techniques 1. **Parallel Processing** ```python from concurrent.futures import ThreadPoolExecutor def process_multiple_objects(object_data_map): with ThreadPoolExecutor(max_workers=5) as executor: futures = [] for obj_name, data in object_data_map.items(): future = executor.submit(bulk_load_object, obj_name, data) futures.append(future) # Wait for all to complete for future in futures: result = future.result() ``` 2. **Data Preparation** - Pre-sort data by parent records - Group related records - Remove unnecessary columns 3. **Resource Management** - Monitor API usage - Implement backpressure - Use connection pooling ## Integration Patterns ### ETL Pipeline Integration ```python class SalesforceBulkETL: def __init__(self, config): self.config = config self.sf_client = self.connect() def extract_transform_load(self, source_data): # Transform data transformed_data = self.transform(source_data) # Create Bulk API job job_id = self.create_job('Account', 'upsert', 'External_Id__c') # Upload in chunks for chunk in self.chunk_data(transformed_data, 50000): self.upload_chunk(job_id, chunk) # Close and monitor self.close_job(job_id) return self.monitor_job(job_id) ``` ## Troubleshooting ### Common Issues and Solutions 1. **Job Timeout** - Break large datasets into smaller jobs - Process during off-peak hours - Use query-based processing for very large datasets 2. **Memory Issues** - Stream data instead of loading into memory - Process in smaller chunks - Use generators for data transformation 3. **Lock Contention** - Avoid updating same records concurrently - Use external IDs for relationships - Consider record locking strategy ## Additional Resources - [Salesforce Bulk API Developer Guide](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/) - [Bulk API 2.0 Quick Start](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/quick_start.htm) - [Bulk API Best Practices](https://developer.salesforce.com/docs/atlas.en-us.api_bulk_v2.meta/api_bulk_v2/best_practices.htm) - [Trailhead: Big Object Basics](https://trailhead.salesforce.com/en/content/learn/modules/big_objects)