claude-flow-novice

Version:

Claude Flow Novice - Advanced orchestration platform for multi-agent AI workflows with CFN Loop architecture Includes Local RuVector Accelerator and all CFN skills for complete functionality.

github.com/cfn-dev/claude-flow-novice

cfn-dev/claude-flow-novice

644 lines (509 loc) • 15.4 kB

Markdown

# Performance Optimization - Architecture Reference Performance patterns, bottlenecks, and optimization strategies for CFN agent systems. ## Table of Contents 1. [Performance Metrics](#performance-metrics) 2. [Critical Bottlenecks](#critical-bottlenecks) 3. [Optimization Strategies](#optimization-strategies) 4. [Scalability Assessment](#scalability-assessment) 5. [Database Performance](#database-performance) 6. [Redis Performance](#redis-performance) 7. [Docker Performance](#docker-performance) 8. [Agent Performance](#agent-performance) 9. [Monitoring & Benchmarking](#monitoring--benchmarking) --- ## Performance Metrics ### Current Baseline (Skills Database) | Operation | Current | Target | Status | |-----------|---------|--------|--------| | Skill deployment | 268ms | <500ms | ✅ PASS | | Skill loading (cached) | 25ms | <30ms | ✅ PASS | | Skill loading (cold) | 27ms | <30ms | ✅ PASS | | Analytics queries | 22-56ms | <100ms | ✅ PASS | | **Skill update** | **535ms** | **<500ms** | ❌ FAIL | | Dual logging | 24ms | <50ms | ✅ PASS | | Concurrent operations (20 queries) | 109ms | <200ms | ✅ PASS | ### System-Level Metrics | Metric | Value | Threshold | |--------|-------|-----------| | Agent spawn time | 2-5s | <10s | | Redis response time | <1ms | <5ms | | SQLite query time | 10-50ms | <100ms | | Docker container start | 1-2s | <5s | | Memory usage per agent | 512MB-1GB | <2GB | | CPU usage per agent | 0.5-1.0 cores | <2 cores | ### Performance Score Calculation ``` Overall Score: 7.8/10 Expected After Optimization: 8.6/10 (+1.0 point, -30% latency) ``` --- ## Critical Bottlenecks ### 1. Skill Update Performance (CRITICAL) **Impact:** High | **Frequency:** Low | **Fix Difficulty:** Medium **Location:** `.claude/skills/workflow-codification/deploy-approved-skill.sh` **Root Cause:** Multiple separate `sqlite3` invocations: ```bash # Current: 5+ separate calls = 350ms overhead sqlite3 "$DB" "INSERT INTO skills..." # 50-80ms sqlite3 "$DB" "INSERT INTO approval_history..." # 50-80ms sqlite3 "$DB" "UPDATE skills SET..." # 50-80ms sqlite3 "$DB" "INSERT INTO agent_skill_mappings..." # 50-80ms per agent ``` **Solution:** Batch into single transaction: ```bash # Optimized: 1-2 calls = 60ms overhead sqlite3 "$DB" <<'EOF' BEGIN TRANSACTION; INSERT INTO skills ...; INSERT INTO approval_history ...; UPDATE skills SET ...; INSERT INTO agent_skill_mappings ...; COMMIT; EOF ``` **Expected Improvement:** 535ms → 150ms (-72% latency) ### 2. LRU Cache Inefficiency (MEDIUM) **Impact:** Medium | **Frequency:** High | **Fix Difficulty:** Low **Root Cause:** Cache misses due to small cache size (50 items). **Solution:** Implement adaptive cache sizing: ```bash # Increase cache size based on system memory SYSTEM_MEMORY=$(free -g | awk '/^Mem:/{print $2}') CACHE_SIZE=$((SYSTEM_MEMORY * 100)) # 100 items per GB ``` **Expected Improvement:** 40% hit rate → 75% hit rate ### 3. Missing Database Indexes (MEDIUM) **Impact:** Medium | **Frequency:** High | **Fix Difficulty:** Low **Solution:** ```sql CREATE INDEX idx_skills_name_hash ON skills(name, content_hash); CREATE INDEX idx_approval_history_skill_created ON approval_history(skill_id, created_at); CREATE INDEX idx_agent_mappings_agent_skill ON agent_skill_mappings(agent_id, skill_id); ``` **Expected Improvement:** Query time -40% --- ## Optimization Strategies ### Batch Operations Pattern **Before:** Individual operations ```bash for skill in $SKILLS; do sqlite3 "$DB" "INSERT INTO skills VALUES (...)" # 50ms each done ``` **After:** Batched transactions ```bash sqlite3 "$DB" <<'EOF' BEGIN TRANSACTION; $(printf "INSERT INTO skills VALUES (...);\n" $SKILLS) COMMIT; EOF ``` **Performance Gain:** 10x improvement for bulk operations ### Connection Pooling **Implementation:** ```javascript class ConnectionPool { constructor(maxSize = 10) { this.pool = []; this.maxSize = maxSize; } async acquire() { if (this.pool.length > 0) { return this.pool.pop(); } return new Promise((resolve, reject) => { const conn = new Redis(config.redis); conn.on('connect', () => resolve(conn)); conn.on('error', reject); }); } release(conn) { if (this.pool.length < this.maxSize) { this.pool.push(conn); } else { conn.quit(); } } } ``` **Performance Gain:** 60-80% reduction in connection overhead ### Query Result Caching **Redis Cache Layer:** ```bash # Cache query results with TTL cache_query() { local query="$1" local ttl="${2:-3600}" local key="cache:query:$(echo "$query" | sha256sum | cut -c1-16)" # Check cache first local cached=$(redis-cli GET "$key" 2>/dev/null) if [ -n "$cached" ]; then echo "$cached" return 0 fi # Execute query and cache result local result=$(sqlite3 "$DB" "$query") redis-cli SETEX "$key" "$ttl" "$result" >/dev/null echo "$result" } ``` **Performance Gain:** 90% reduction for repeated queries --- ## Scalability Assessment ### Current State (11 skills, 50 logs) - Database size: 228KB - Query response: 22-56ms - Cache hit rate: ~40% - Capacity utilization: <2% ### Projected Growth | Scale | Database Size | Query Time | Concerns | Recommendations | |-------|---------------|------------|----------|-----------------| | 100 skills | ~2MB | 25-70ms | Low | Monitor cache hit rate | | 1,000 skills | ~20MB | 30-90ms | Medium | Increase cache to 500 items | | 10,000+ skills | ~200MB | 100-500ms | High | Migrate to PostgreSQL | ### Scaling Strategies **Phase 1: SQLite Optimizations (0-1,000 skills)** - Implement aggressive caching - Batch operations - Optimize indexes **Phase 2: Hybrid Approach (1,000-10,000 skills)** - SQLite for hot data - PostgreSQL for analytics - Redis for coordination **Phase 3: Full Migration (10,000+ skills)** - PostgreSQL primary - Read replicas for analytics - Sharding for massive scale --- ## Database Performance ### SQLite Optimization **PRAGMA Settings:** ```sql PRAGMA journal_mode = WAL; -- Better concurrency PRAGMA synchronous = NORMAL; -- Balance safety/performance PRAGMA cache_size = -64000; -- 64MB cache PRAGMA temp_store = MEMORY; -- Temp tables in memory PRAGMA mmap_size = 268435456; -- 256MB memory mapping ``` **Batch Insert Pattern:** ```bash # Prepare batch data batch_data="" for record in "$@"; do batch_data+="INSERT INTO skills VALUES ($record);" done # Execute as single transaction { echo "PRAGMA journal_mode = WAL;" echo "BEGIN TRANSACTION;" echo "$batch_data" echo "COMMIT;" } | sqlite3 "$DB" ``` **Vacuum and Analyze:** ```bash # Schedule regular maintenance sqlite3 "$DB" "VACUUM;" sqlite3 "$DB" "ANALYZE;" ``` ### PostgreSQL Migration Benefits **Advantages for High Scale:** - True concurrent writes - Better query planner - Partitioning support - Native replication **Migration Path:** ```bash # Export SQLite data sqlite3 "$DB" ".dump --data-only" > data.sql # Import to PostgreSQL psql "$PG_URL" < data.sql ``` --- ## Redis Performance ### Memory Optimization **Key Expiry Strategy:** ```bash # Set appropriate TTLs redis-cli SETEX "session:$TASK_ID" 3600 "$DATA" # 1 hour redis-cli SETEX "coord:$TASK_ID" 86400 "$STATE" # 24 hours redis-cli SETEX "cache:query:$HASH" 300 "$RESULT" # 5 minutes ``` **Memory Monitoring:** ```bash # Check memory usage redis-cli INFO memory | grep used_memory_human # Monitor key expiration redis-cli CONFIG SET notify-keyspace-events Ex ``` ### Pipeline Operations **Batch Commands:** ```javascript // Redis pipeline for multiple operations const pipeline = redis.pipeline(); pipeline.hset('context:task1', 'field1', 'value1'); pipeline.hset('context:task1', 'field2', 'value2'); pipeline.expire('context:task1', 3600); await pipeline.exec(); ``` **Performance Gain:** 5-10x improvement for bulk operations ### Connection Management **Connection Pool Configuration:** ```javascript const redis = new Redis({ port: 6379, host: 'redis', maxRetriesPerRequest: 3, retryDelayOnFailover: 100, lazyConnect: true, keepAlive: 30000, maxMemoryPolicy: 'allkeys-lru' }); ``` --- ## Docker Performance ### Build Optimization **Linux Native Build:** ```bash # REQUIRED: Build from Linux filesystem export DOCKER_BUILDKIT=1 docker build \ --file docker/Dockerfile.agent \ --target production \ --build-arg BUILDKIT_INLINE_CACHE=1 \ --cache-from cfn-agent:cache \ --tag cfn-agent:latest \ . # Build times: # Windows mount: ~755s # Linux native: <20s ``` ### Container Resource Tuning **Memory Limits:** ```yaml services: agent: deploy: resources: limits: memory: 1G cpus: '1.0' reservations: memory: 512M cpus: '0.5' environment: - NODE_OPTIONS=--max-old-space-size=768 ``` **CPU Pinning (for high-performance):** ```yaml services: agent: deploy: resources: limits: cpus: '1.0' cpuset: '0,1' # Pin to specific cores ``` ### Volume Performance **Optimized Volume Configuration:** ```yaml services: agent: volumes: - type: tmpfs target: /tmp tmpfs: size: 1G mode: 1777 - type: bind source: ${PWD}/.claude target: /root/.claude bind: propagation: shared ``` --- ## Agent Performance ### Spawn Time Optimization **Pre-warming Strategy:** ```bash # Maintain pool of warm containers docker run -d --name agent-pool-1 cfn-agent:latest sleep infinity docker run -d --name agent-pool-2 cfn-agent:latest sleep infinity # Quick spawn from pool docker exec agent-pool-1 /root/.claude/agents/entrypoint.sh ``` **Parallel Spawning:** ```bash # Spawn agents in parallel for agent_id in $AGENT_IDS; do ( export AGENT_ID="$agent_id" ./spawn-agent.sh "$agent_id" "$TASK_ID" ) & done wait ``` ### Memory Management **Garbage Collection Tuning:** ```bash export NODE_OPTIONS="--max-old-space-size=768 --max-new-space-size=128" ``` **Memory Cleanup:** ```bash # Cleanup after agent completion cleanup_agent() { local agent_id="$1" docker stop "agent-${agent_id}" || true docker rm "agent-${agent_id}" || true # Clear Redis keys redis-cli DEL "coord:${TASK_ID}:${agent_id}:*" 2>/dev/null || true } ``` ### Execution Optimization **Timeout Management:** ```bash # Adaptive timeout based on task complexity calculate_timeout() { local complexity="$1" local base_timeout=300 case "$complexity" in "low") echo $((base_timeout / 2)) ;; "medium") echo $base_timeout ;; "high") echo $((base_timeout * 2)) ;; esac } ``` --- ## Monitoring & Benchmarking ### Performance Metrics Collection **Prometheus Metrics:** ```javascript // Custom metrics registry const client = require('prom-client'); const agentSpawnDuration = new client.Histogram({ name: 'agent_spawn_duration_seconds', help: 'Time to spawn agent container', labelNames: ['agent_type', 'mode'] }); const queryDuration = new client.Histogram({ name: 'database_query_duration_seconds', help: 'Database query execution time', labelNames: ['query_type', 'table'] }); ``` ### Benchmarking Tools **Database Benchmark:** ```bash #!/bin/bash # Benchmark script for SQLite operations benchmark_sql() { local iterations="$1" local query="$2" echo "Benchmarking: $query" for i in $(seq 1 $iterations); do start=$(date +%s%N) sqlite3 "$DB" "$query" >/dev/null end=$(date +%s%N) duration=$(((end - start) / 1000000)) echo "$duration" done } # Run benchmarks benchmark_sql 1000 "SELECT * FROM skills WHERE name = 'test'" benchmark_sql 100 "INSERT INTO skills VALUES (...)" # Calculate statistics awk '{sum+=$1; sumsq+=$1*$1} END {print "Avg:", sum/NR, "Std:", sqrt(sumsq/NR - (sum/NR)^2)}' ``` ### Performance Alerts **Alert Rules:** ```yaml # alerts.yml groups: - name: performance rules: - alert: SlowQuery expr: histogram_quantile(0.95, database_query_duration_seconds) > 0.1 for: 5m labels: severity: warning annotations: summary: "Slow database queries detected" - alert: HighMemoryUsage expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 10m labels: severity: critical annotations: summary: "Container approaching memory limit" ``` ### Continuous Performance Testing **Load Test Script:** ```bash #!/bin/bash # Load test CFN Loop with concurrent agents concurrent_agents=${1:-10} test_duration=${2:-300} echo "Starting load test: $concurrent_agents agents for ${test_duration}s" # Spawn test agents for i in $(seq 1 $concurrent_agents); do AGENT_ID="test-agent-$i" \ TASK_ID="load-test-$(date +%s)" \ ./spawn-agent.sh & done # Monitor metrics during test start_time=$(date +%s) while [ $(($(date +%s) - start_time)) -lt $test_duration ]; do # Collect metrics memory_usage=$(docker stats --no-stream --format "{{.MemUsage}}" | head -1) redis_ops=$(redis-cli INFO stats | grep instantaneous_ops_per_sec) echo "$(date): Memory: $memory_usage, Redis ops: $redis_ops" sleep 10 done ``` --- ## Performance Quick Reference ### Optimization Priority Matrix | Area | Impact | Effort | Priority | Est. Time | |------|--------|--------|----------|-----------| | Skill Update | High | 2h | P0 | 1-2 days | | LRU Cache | Medium | 3h | P1 | 3-5 days | | Indexes | Medium | 1h | P1 | 1 day | | Connection Pooling | Medium | 4h | P1 | 2-3 days | | Query Caching | High | 6h | P1 | 3-4 days | | PostgreSQL Migration | Strategic | 16h | P3 | 1-2 weeks | ### Performance Check Commands ```bash # Database performance sqlite3 "$DB" "EXPLAIN QUERY PLAN SELECT * FROM skills WHERE name = 'test'" sqlite3 "$DB" "PRAGMA table_info(skills)" # Redis performance redis-cli LATENCY LATEST redis-cli INFO stats # Container performance docker stats --no-stream docker exec $CONTAINER cat /proc/meminfo # Network performance iperf3 -c redis -t 30 ``` ### Performance Tuning Checklist **Database:** - [ ] Add appropriate indexes - [ ] Use WAL journal mode - [ ] Implement query result caching - [ ] Batch operations in transactions - [ ] Regular VACUUM and ANALYZE **Redis:** - [ ] Set appropriate key TTLs - [ ] Use pipelines for bulk operations - [ ] Monitor memory usage - [ ] Implement connection pooling **Docker:** - [ ] Build from Linux filesystem - [ ] Set appropriate resource limits - [ ] Use tmpfs for temporary data - [ ] Pin CPUs for performance-critical tasks **Agents:** - [ ] Implement timeouts - [ ] Cleanup resources after completion - [ ] Use parallel spawning where appropriate - [ ] Monitor memory usage