sf-agent-framework
Version:
AI Agent Orchestration Framework for Salesforce Development - Two-phase architecture with 70% context reduction
628 lines (539 loc) • 18.4 kB
Markdown
# Data Extraction Task
This task guides the systematic extraction of data from various sources for
Salesforce migration, integration, and analysis purposes.
## Purpose
Enable ETL developers to:
- Design efficient data extraction processes
- Ensure data integrity and completeness
- Optimize extraction performance
- Handle complex data relationships
- Maintain security and compliance
## Prerequisites
- Access to source systems and databases
- Understanding of source data schemas
- Salesforce target data model knowledge
- Appropriate security credentials and permissions
- Data extraction tools and infrastructure
## Data Extraction Framework
### 1. Source System Analysis
**Data Source Assessment**
```yaml
Source Systems:
Salesforce_Org:
Type: Salesforce Production/Sandbox
Access: REST API, SOAP API, Bulk API
Limits: API call limits, data storage
Format: JSON, XML, CSV
Database_Systems:
Type: SQL Server, Oracle, MySQL, PostgreSQL
Access: Direct connection, ODBC, JDBC
Constraints: Connection limits, query timeouts
Format: Tables, views, stored procedures
File_Systems:
Type: CSV, JSON, XML, Excel
Access: FTP, SFTP, cloud storage
Structure: Flat files, hierarchical
Format: Delimited, fixed-width
Web_Services:
Type: REST APIs, SOAP services
Authentication: OAuth, API keys, tokens
Rate_Limits: Requests per minute/hour
Format: JSON, XML responses
```
**Data Profiling and Discovery** **Data Profiling Algorithms**
### Algorithm: Record Distribution Analysis
```
INPUT: objectName (e.g., "Account")
PROCESS:
1. INITIALIZE counters:
- totalRecords = 0
- uniqueValues = empty set
- earliestDate = null
- latestDate = null
2. FOR each record in object:
- INCREMENT totalRecords
- ADD field values to uniqueValues sets
- UPDATE earliestDate if record.createdDate < earliestDate
- UPDATE latestDate if record.createdDate > latestDate
3. CALCULATE statistics:
- uniqueCount = size of uniqueValues set
- dateRange = latestDate - earliestDate
4. RETURN profiling results:
- Object name
- Total record count
- Unique value counts per field
- Date range of records
```
### Algorithm: Data Quality Assessment
```
INPUT: objectName, fieldName, qualityRule
PROCESS:
1. INITIALIZE:
- issueCount = 0
- totalCount = 0
2. FOR each record in object:
- INCREMENT totalCount
- IF qualityRule(record.fieldName) fails THEN
INCREMENT issueCount
3. CALCULATE percentage = (issueCount / totalCount) * 100
4. RETURN quality metrics:
- Issue description
- Absolute count
- Percentage of total
```
### Algorithm: Relationship Complexity Analysis
```
INPUT: childObject, parentRelationship
PROCESS:
1. CREATE relationship map:
- parentCounts = dictionary
- relationshipGroups = dictionary
2. FOR each child record:
- GET parentId from relationship
- INCREMENT parentCounts[parentId]
- ADD record to relationshipGroups[relationshipName]
3. CALCULATE metrics:
- totalRelationships = sum of all counts
- uniqueParents = count of keys in parentCounts
- averageChildrenPerParent = totalRelationships / uniqueParents
4. RETURN complexity analysis:
- Relationship name
- Total record count
- Unique parent count
- Distribution statistics
```
### 2. Extraction Strategy Design
**Extraction Patterns**
```yaml
Full_Extract:
Use_Cases:
- Initial data migration
- Complete system refresh
- Data archival
Considerations:
- Large data volumes
- Extended processing time
- System impact
Incremental_Extract:
Use_Cases:
- Regular synchronization
- Change data capture
- Real-time updates
Methods:
- Timestamp-based
- Sequence-based
- Change log analysis
Delta_Extract:
Use_Cases:
- Modified records only
- Efficient updates
- Minimal system impact
Tracking:
- LastModifiedDate
- SystemModstamp
- Custom change flags
```
**Performance Optimization Strategy**
```json
{
"extraction_optimization": {
"bulk_operations": {
"salesforce_bulk_api": "2.0",
"batch_size": 10000,
"parallel_processing": true,
"compression": "gzip"
},
"query_optimization": {
"selective_queries": true,
"indexed_fields": "use in WHERE clauses",
"limit_results": "paginate large datasets",
"avoid_wildcards": "specify field lists"
},
"resource_management": {
"connection_pooling": true,
"memory_management": "stream processing",
"error_handling": "retry logic",
"logging": "detailed audit trail"
}
}
}
```
## Implementation Steps
### Step 1: Environment Setup and Configuration
**Connection Configuration**
```python
# Salesforce Connection Example
import simple_salesforce as sf
from simple_salesforce.bulk import SFBulkHandler
# OAuth connection setup
sf_connection = sf.Salesforce(
username='user@company.com',
password='password',
security_token='token',
domain='test' # for sandbox
)
# Bulk API handler for large data sets
bulk_handler = SFBulkHandler(
session_id=sf_connection.session_id,
bulk_url=sf_connection.bulk_url
)
```
**Database Connection Setup**
```python
import pyodbc
import pandas as pd
from sqlalchemy import create_engine
# SQL Server connection
connection_string = (
"DRIVER={ODBC Driver 17 for SQL Server};"
"SERVER=server_name;"
"DATABASE=database_name;"
"UID=username;"
"PWD=password"
)
conn = pyodbc.connect(connection_string)
# PostgreSQL connection with SQLAlchemy
engine = create_engine('postgresql://user:password@localhost:5432/dbname')
```
### Step 2: Data Extraction Implementation
**Salesforce Data Extraction**
```
Algorithm: Salesforce Account Data Extraction
INPUT: connection_object, batch_size (default: 10000)
PROCESS:
1. DEFINE field_list = [Id, Name, Type, Industry, BillingStreet, BillingCity,
BillingState, BillingPostalCode, BillingCountry, Phone,
Website, AnnualRevenue, NumberOfEmployees, CreatedDate,
LastModifiedDate]
2. INCLUDE related_data = [Contact records where IsDeleted = FALSE]
3. BUILD query with field_list and related_data
4. ESTIMATE total_record_count for Account object
5. IF total_record_count < 50000 THEN
use standard SOQL query execution
6. ELSE
use bulk_api_extraction method
7. RETURN extracted_data_set
OUTPUT: account_records_with_related_contacts
```
```
Algorithm: Bulk API Data Extraction
INPUT: object_name, query, batch_size
PROCESS:
1. CREATE bulk_extraction_job for object_name
2. SUBMIT query to bulk_api
3. INITIALIZE results = empty_collection
4. WHILE job has remaining batches:
a. RETRIEVE next batch from job
b. GET batch_results from batch
c. APPEND batch_results to results
5. RETURN consolidated results
OUTPUT: complete_extracted_dataset
```
**Database Extraction with Change Detection**
```
Algorithm: Incremental Database Extraction
INPUT: table_name, timestamp_field, last_extraction_time (optional)
PROCESS:
1. BUILD base_query = "SELECT * FROM " + table_name
2. IF last_extraction_time is provided THEN
a. ADD WHERE clause: timestamp_field > last_extraction_time
b. ADD ORDER BY timestamp_field
3. ELSE
ADD ORDER BY timestamp_field only
4. EXECUTE query against database connection
5. RETURN query_results as structured dataset
OUTPUT: extracted_records_since_last_extraction
```
```
Algorithm: Database Extraction with Integrity Verification
INPUT: table_name, key_field
PROCESS:
1. BUILD query with all table fields
2. ADD checksum calculation for row integrity
3. ADD ORDER BY key_field for consistent results
4. EXECUTE query against database connection
5. CALCULATE row_checksum for each record
6. RETURN dataset with original_data and checksum_verification
OUTPUT: records_with_integrity_checksums
```
### Step 3: Data Relationship Handling
**Hierarchical Data Extraction**
```
Algorithm: Account Hierarchy Extraction
INPUT: salesforce_connection
PROCESS:
1. DEFINE field_list = [Id, Name, ParentId, Type, Industry, CreatedDate, LastModifiedDate]
2. BUILD query to extract all non-deleted accounts
3. ORDER results by ParentId (nulls first), then by Name
4. EXECUTE query and retrieve all account records
5. INITIALIZE hierarchy structure:
- root_accounts = empty list
- child_accounts = empty dictionary
- orphaned_accounts = empty list
6. FOR each account in extracted records:
a. IF account.ParentId is null THEN
ADD account to root_accounts
b. ELSE
IF parent_id not in child_accounts THEN
CREATE new list for parent_id
ADD account to child_accounts[parent_id]
7. RETURN structured hierarchy
OUTPUT: hierarchical_account_structure
```
**Related Object Extraction**
```
Algorithm: Opportunity Ecosystem Extraction
INPUT: account_ids_list
PROCESS:
1. INITIALIZE opportunities = empty collection
2. FOR each account_id in account_ids_list:
a. DEFINE opportunity_fields = [Id, Name, StageName, Amount, CloseDate, AccountId]
b. DEFINE related_line_items = [Id, PricebookEntry.Product2.Name, Quantity,
UnitPrice, TotalPrice]
c. DEFINE related_tasks = [Id, Subject, ActivityDate, Status, WhoId
WHERE IsClosed = FALSE]
d. DEFINE related_events = [Id, Subject, ActivityDateTime, WhoId
WHERE ActivityDateTime >= TODAY]
e. BUILD query including opportunity_fields and all related objects
f. ADD filter: AccountId = current account_id AND IsDeleted = FALSE
g. EXECUTE query for current account
h. APPEND results to opportunities collection
3. RETURN complete opportunities dataset
OUTPUT: opportunities_with_related_objects
```
## Advanced Extraction Techniques
### Real-time Data Extraction
```
Algorithm: Streaming API Setup for Real-time Updates
INPUT: object_name, fields_list
PROCESS:
1. CREATE push_topic_configuration:
- Name = object_name + "_Updates"
- Query = "SELECT " + join(fields_list) + " FROM " + object_name
- ApiVersion = current_api_version
- NotifyForOperationCreate = true
- NotifyForOperationUpdate = true
- NotifyForOperationDelete = true
- NotifyForFields = "All"
2. SUBMIT push_topic_configuration to Salesforce
3. RECEIVE push_topic_id from creation response
4. RETURN push_topic_id for subscription
OUTPUT: streaming_topic_identifier
```
```
Algorithm: Real-time Change Listener
INPUT: push_topic_name, callback_function
PROCESS:
1. INITIALIZE streaming_client with:
- session_id from salesforce_connection
- instance_url from salesforce_connection
2. SUBSCRIBE to streaming topic: "/topic/" + push_topic_name
3. REGISTER callback_function for change notifications
4. START streaming client listener
5. CONTINUOUSLY process incoming change events
6. FOR each change event:
EXECUTE callback_function with event_data
OUTPUT: continuous_real_time_monitoring
```
### Large Volume Data Handling
```
Algorithm: Large Dataset Extraction with Pagination
INPUT: object_name, query, chunk_size (default: 50000)
PROCESS:
1. INITIALIZE all_records = empty collection
2. MODIFY query to include LIMIT chunk_size
3. EXECUTE initial query
4. ADD initial results to all_records
5. WHILE query result indicates more records available:
a. GET next_records_url from previous result
b. EXECUTE query_more with next_records_url
c. ADD new results to all_records
d. IF all_records count is multiple of (chunk_size * 5) THEN
save intermediate results for recovery
6. RETURN complete all_records collection
OUTPUT: complete_large_dataset
```
```
Algorithm: Parallel Multi-Query Extraction
INPUT: queries_dictionary, max_workers (default: 5)
PROCESS:
1. INITIALIZE results = empty dictionary
2. CREATE thread_pool with max_workers threads
3. FOR each query_name and query in queries_dictionary:
SUBMIT query execution to thread_pool
4. COLLECT completed futures as they finish:
a. GET query_name for completed future
b. TRY to get future result
c. IF successful THEN
SET results[query_name] = query_result
d. ELSE
SET results[query_name] = error_information
5. RETURN results dictionary with all query outcomes
OUTPUT: parallel_query_results
```
## Data Quality and Validation
### Extraction Validation Framework
```
Algorithm: Record Count Validation
INPUT: source_count, extracted_count, tolerance (default: 0.01)
PROCESS:
1. CALCULATE variance = |source_count - extracted_count| / source_count
2. CALCULATE variance_percentage = variance * 100
3. DETERMINE within_tolerance = (variance <= tolerance)
4. IF within_tolerance THEN
SET status = "PASS"
5. ELSE
SET status = "FAIL"
6. CREATE validation_result with all metrics
7. RETURN validation_result
OUTPUT: count_validation_report
```
```
Algorithm: Data Integrity Validation
INPUT: extracted_data, key_field
PROCESS:
1. CONVERT extracted_data to structured format
2. COUNT total_records in dataset
3. COUNT unique_keys in key_field
4. CALCULATE duplicate_count = total_records - unique_keys
5. COUNT null_keys in key_field
6. VERIFY data_types_consistent across records
7. COMPILE integrity_metrics
8. RETURN integrity_validation_report
OUTPUT: data_integrity_assessment
```
```
Algorithm: Referential Integrity Validation
INPUT: parent_data, child_data, parent_key, foreign_key
PROCESS:
1. EXTRACT parent_ids from parent_data using parent_key
2. EXTRACT foreign_keys from child_data using foreign_key (exclude nulls)
3. IDENTIFY orphaned_records = foreign_keys NOT IN parent_ids
4. COUNT parent_record_count, child_record_count, orphaned_child_count
5. IF child_record_count > 0 THEN
CALCULATE integrity_percentage = (1 - orphaned_child_count / child_record_count) * 100
6. ELSE
SET integrity_percentage = 100
7. COMPILE referential_integrity_report
8. RETURN validation_results
OUTPUT: referential_integrity_assessment
```
## Error Handling and Recovery
### Robust Extraction Pipeline
```
Algorithm: Extraction with Retry Logic
INPUT: extraction_function, function_parameters, retry_attempts (default: 3), backoff_factor (default: 2)
PROCESS:
1. FOR attempt = 1 to retry_attempts:
a. TRY to execute extraction_function with function_parameters
b. IF successful THEN
RETURN extraction_result
c. IF exception occurs THEN
IF attempt = retry_attempts THEN
RAISE final exception
ELSE
CALCULATE wait_time = backoff_factor ^ (attempt - 1)
LOG failure message and retry information
WAIT for wait_time seconds
2. IF all attempts fail THEN
RAISE extraction_failure_exception
OUTPUT: successful_extraction_result OR exception
```
```
Algorithm: Extraction Progress Checkpointing
INPUT: extracted_data, checkpoint_id
PROCESS:
1. CREATE checkpoint_metadata:
- checkpoint_id = provided identifier
- timestamp = current datetime
- record_count = count of extracted_data
- data = extracted_data
2. SERIALIZE checkpoint_metadata to storage format
3. SAVE to file: "checkpoint_" + checkpoint_id + ".json"
4. CONFIRM successful save operation
OUTPUT: checkpoint_saved_confirmation
```
```
Algorithm: Resume from Checkpoint
INPUT: checkpoint_id
PROCESS:
1. CONSTRUCT checkpoint_filename = "checkpoint_" + checkpoint_id + ".json"
2. TRY to read checkpoint_filename
3. IF file exists THEN
a. DESERIALIZE checkpoint_data from file
b. RETURN checkpoint_data['data']
4. ELSE
RETURN null (no checkpoint found)
OUTPUT: recovered_data OR null
```
## Performance Monitoring and Optimization
### Extraction Performance Metrics
```
Algorithm: Extraction Performance Tracking
INPUT: object_name, start_time, end_time, record_count, data_size_mb
PROCESS:
1. CALCULATE duration_seconds = end_time - start_time
2. IF duration_seconds > 0 THEN
a. CALCULATE records_per_second = record_count / duration_seconds
b. CALCULATE mb_per_second = data_size_mb / duration_seconds
3. ELSE
SET records_per_second = 0, mb_per_second = 0
4. CREATE performance_metrics:
- object_name, duration_seconds, records_extracted
- data_size_mb, records_per_second, mb_per_second
- extraction_timestamp = start_time
5. STORE metrics for object_name
6. RETURN performance_metrics
OUTPUT: extraction_performance_data
```
```
Algorithm: Performance Report Generation
INPUT: collected_metrics_for_all_objects
PROCESS:
1. CALCULATE summary_statistics:
- total_objects = count of metrics
- total_records = sum of records_extracted across all objects
- total_data_size_mb = sum of data_size_mb across all objects
- average_records_per_second = mean of records_per_second values
2. COMPILE object_details from individual metrics
3. GENERATE optimization_recommendations based on performance patterns
4. CREATE comprehensive_report with:
- summary_statistics
- object_details
- optimization_recommendations
5. RETURN performance_report
OUTPUT: comprehensive_performance_analysis
```
## Tools and Integration
### Supported Tools and Platforms
```yaml
ETL_Tools:
Talend:
- Salesforce connectors
- Built-in data quality
- Visual job design
Informatica:
- PowerCenter
- Cloud Data Integration
- Real-time processing
MuleSoft:
- Anypoint Platform
- API-led connectivity
- Real-time synchronization
Custom_Solutions:
- Python with pandas
- Apache Airflow
- AWS Glue
- Azure Data Factory
Monitoring_Tools:
- Salesforce Event Monitoring
- Custom logging frameworks
- APM solutions (New Relic, Datadog)
- Database performance monitors
```
## Success Criteria
✅ Source systems analyzed and profiled ✅ Extraction strategy designed and
documented ✅ Performance optimization implemented ✅ Data validation framework
established ✅ Error handling and recovery mechanisms active ✅ Monitoring and
alerting configured ✅ Documentation and runbooks completed ✅ Stakeholder
sign-off obtained