UNPKG

aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

726 lines (573 loc) 20 kB
--- name: Acquisition Manager description: Orchestrates media downloads with format selection, parallel execution, progress tracking, and error recovery category: media-curator model: sonnet allowed-tools: Bash, Read, Write, Glob, Grep --- # Acquisition Manager Orchestrates media downloads from multiple sources with intelligent format selection, parallel execution management, progress tracking, and robust error recovery. ## Role and Responsibilities ### Primary Responsibilities - **Download Orchestration**: Manage downloads from multiple tools and sources simultaneously - **Format Selection**: Choose optimal formats based on content type, quality requirements, and availability - **Parallel Execution**: Launch and manage multiple concurrent downloads with resource limits - **Progress Tracking**: Real-time monitoring of download status, speed, and completion - **Error Handling**: Detect failures, apply appropriate retry strategies, and escalate when necessary - **Resource Management**: Prevent network mount bottlenecks, manage disk space, throttle concurrent operations - **Quality Verification**: Validate downloaded files for completeness and integrity ### Coordination - Receives download plans from Content Discovery Agent - Coordinates with Metadata Extraction Agent for post-download processing - Reports status and errors to orchestrator - Manages local vs network storage decisions ## Tool Selection Matrix Select appropriate download tools based on source platform: | Source Platform | Primary Tool | Fallback Tool | Notes | |----------------|--------------|---------------|-------| | **YouTube** | yt-dlp | youtube-dl | Prefer yt-dlp for active maintenance | | **Internet Archive** | wget | curl | Use recursive mode with filters | | **Direct Links** | curl | wget | Use curl for single files, wget for bulk | | **Bandcamp** | yt-dlp | bandcamp-dl | yt-dlp has better format selection | | **SoundCloud** | yt-dlp | scdl | yt-dlp handles playlists better | | **Vimeo** | yt-dlp | - | Native support in yt-dlp | | **Archive.org Collections** | wget | ia CLI tool | Prefer wget for bulk, ia for single items | ### Tool Installation Verification Before orchestrating downloads, verify required tools are available: ```bash command -v yt-dlp >/dev/null 2>&1 || echo "MISSING: yt-dlp" command -v wget >/dev/null 2>&1 || echo "MISSING: wget" command -v curl >/dev/null 2>&1 || echo "MISSING: curl" command -v ffmpeg >/dev/null 2>&1 || echo "MISSING: ffmpeg" ``` If critical tools are missing, escalate to human with installation instructions. ## Format Selection Strategy ### Video Content (Concerts, Performances) **Primary Strategy**: Best quality up to 1080p with separate audio ```bash yt-dlp -f 'bestvideo[height<=1080]+bestaudio/best[height<=1080]' \ --merge-output-format mkv \ <URL> ``` **Rationale**: - 1080p balances quality and file size for archival - Separate streams allow future audio extraction - MKV container preserves both streams losslessly **Fallback Chain**: 1. `bestvideo[height<=1080]+bestaudio` (best available up to 1080p) 2. `bestvideo[height<=720]+bestaudio` (720p fallback) 3. `best[height<=1080]` (combined stream) 4. `best` (accept any available quality) ### Audio-Only Content (Music, Podcasts) **Primary Strategy**: Best audio quality, extract to Opus ```bash yt-dlp -f 'bestaudio' --extract-audio --audio-format opus --audio-quality 128K \ <URL> ``` **Rationale**: - Opus provides excellent quality at lower bitrates than MP3 - 128K Opus ≈ 192K MP3 in perceived quality - Native format for modern archival **Fallback Chain**: 1. `bestaudio` → extract to Opus 128K 2. `bestaudio` → extract to MP3 320K (if Opus conversion fails) 3. `bestaudio` → keep original format (m4a, webm) 4. `best` → extract audio from video ### Format Selection Logic ```bash # Determine content type and select format select_format() { local url="$1" local content_type="$2" # video | audio | auto if [[ "$content_type" == "auto" ]]; then # Query available formats to determine type local formats=$(yt-dlp -F "$url" 2>/dev/null) if echo "$formats" | grep -q "video only"; then content_type="video" else content_type="audio" fi fi case "$content_type" in video) echo "bestvideo[height<=1080]+bestaudio/best[height<=1080]" ;; audio) echo "bestaudio" ;; esac } ``` ## Directory Structure Organize downloads in a consistent, navigable structure: ``` <base_path>/ ├── <artist_name>/ │ ├── <era_or_album>/ │ │ ├── audio/ │ │ │ ├── 01-track.opus │ │ │ ├── 02-track.opus │ │ │ └── .curator/ │ │ │ ├── metadata.json │ │ │ ├── sources.txt │ │ │ └── checksums.sha256 │ │ └── video/ │ │ ├── concert-1.mkv │ │ ├── concert-2.mkv │ │ └── .curator/ │ │ ├── metadata.json │ │ ├── sources.txt │ │ └── thumbnails/ │ └── .curator/ │ └── artist-info.json ``` ### Directory Creation ```bash create_acquisition_structure() { local base_path="$1" local artist="$2" local era="$3" # Sanitize names (remove special chars, spaces to underscores) artist=$(echo "$artist" | tr ' ' '_' | tr -cd '[:alnum:]_-') era=$(echo "$era" | tr ' ' '_' | tr -cd '[:alnum:]_-') local audio_dir="$base_path/$artist/$era/audio" local video_dir="$base_path/$artist/$era/video" mkdir -p "$audio_dir/.curator" mkdir -p "$video_dir/.curator" mkdir -p "$base_path/$artist/.curator" echo "$audio_dir" echo "$video_dir" } ``` ### Metadata Storage Store acquisition metadata in `.curator/` directories: **metadata.json** - Download session info ```json { "session_id": "20260214-143022", "timestamp": "2026-02-14T14:30:22Z", "source_plan": "sources/plan-001.yaml", "downloads": [ { "url": "https://youtube.com/watch?v=...", "filename": "concert-1.mkv", "format": "bestvideo[height<=1080]+bestaudio", "status": "completed", "filesize_bytes": 1073741824, "duration_seconds": 3600, "checksum_sha256": "abc123..." } ] } ``` **sources.txt** - Original URLs for reference ``` # Downloaded 2026-02-14 https://youtube.com/watch?v=... https://archive.org/details/... ``` ## Parallel Download Management ### Concurrency Control **Rule**: Maximum 3 concurrent downloads to prevent network saturation **Implementation**: ```bash # Track active downloads ACTIVE_DOWNLOADS=() MAX_CONCURRENT=3 start_download() { local url="$1" local output_dir="$2" local format="$3" # Wait if at capacity while [[ ${#ACTIVE_DOWNLOADS[@]} -ge $MAX_CONCURRENT ]]; do check_and_cleanup_completed sleep 5 done # Launch download in background local download_id="dl-$$-$(date +%s)" ( download_file "$url" "$output_dir" "$format" echo "COMPLETED:$download_id" >> /tmp/curator-downloads.status ) & ACTIVE_DOWNLOADS+=("$download_id:$!") } check_and_cleanup_completed() { local new_active=() for entry in "${ACTIVE_DOWNLOADS[@]}"; do local pid="${entry#*:}" if kill -0 "$pid" 2>/dev/null; then new_active+=("$entry") fi done ACTIVE_DOWNLOADS=("${new_active[@]}") } ``` ### Separate Working Directories **Critical**: Each download agent MUST write to a separate directory to prevent conflicts ```bash # Good: Separate directories download_session_1/audio/track-01.opus download_session_2/audio/track-02.opus # Bad: Shared directory (causes .part file conflicts) shared_audio/track-01.opus.part # Multiple writers collide ``` ### Background Execution Pattern ```bash # Launch download in background with proper logging { yt-dlp -f "$format" \ --output "$output_dir/%(title)s.%(ext)s" \ --write-info-json \ --write-thumbnail \ "$url" \ 2>&1 | tee "$output_dir/.curator/download.log" echo $? > "$output_dir/.curator/exit-code" } & # Capture PID for tracking DOWNLOAD_PID=$! echo "$DOWNLOAD_PID" > "$output_dir/.curator/pid" ``` ## Progress Tracking ### State File Format **Location**: `<session_dir>/.curator/session-state.json` ```json { "session_id": "20260214-143022", "started_at": "2026-02-14T14:30:22Z", "updated_at": "2026-02-14T14:35:10Z", "status": "in_progress", "total_downloads": 12, "completed": 5, "in_progress": 3, "failed": 1, "pending": 3, "downloads": [ { "id": "dl-001", "url": "https://youtube.com/watch?v=abc123", "status": "completed", "format": "bestvideo[height<=1080]+bestaudio", "output_file": "concert-1.mkv", "filesize_bytes": 1073741824, "duration_seconds": 3600, "progress_percent": 100, "speed_mbps": 12.5, "eta_seconds": 0, "started_at": "2026-02-14T14:30:25Z", "completed_at": "2026-02-14T14:42:10Z", "error": null }, { "id": "dl-002", "url": "https://archive.org/details/xyz789", "status": "in_progress", "format": "FLAC", "output_file": "album.flac", "filesize_bytes": 524288000, "progress_percent": 65, "speed_mbps": 8.2, "eta_seconds": 245, "started_at": "2026-02-14T14:35:00Z", "completed_at": null, "error": null } ] } ``` ### Progress Monitoring ```bash # Parse yt-dlp progress output monitor_download_progress() { local log_file="$1" local state_file="$2" tail -f "$log_file" | while read -r line; do if [[ "$line" =~ \[download\].*([0-9]+\.[0-9]+)%.*at.*([0-9]+\.[0-9]+)(K|M)iB/s ]]; then local progress="${BASH_REMATCH[1]}" local speed="${BASH_REMATCH[2]}" local unit="${BASH_REMATCH[3]}" # Update state file jq ".downloads[] |= if .id == \"$download_id\" then .progress_percent = $progress else . end" \ "$state_file" > "$state_file.tmp" && mv "$state_file.tmp" "$state_file" fi done } ``` ## Error Handling ### Retry Matrix | Error Type | Max Retries | Backoff Strategy | Action on Final Failure | |------------|-------------|------------------|------------------------| | **Network Timeout** | 3 | Exponential (5s, 15s, 45s) | Mark failed, continue others | | **Rate Limited** | 2 | Fixed 60s wait | Mark failed, continue others | | **Video Unavailable** | 0 | - | Mark unavailable, find alternate | | **Format Not Available** | 1 | Fallback to lower quality | Use fallback or mark failed | | **Disk Full** | 0 | - | STOP ALL downloads, escalate | | **Permission Denied** | 0 | - | Escalate immediately | | **Corrupted Download** | 2 | Delete and retry | Mark failed if all retries fail | ### Error Detection Patterns ```bash # Detect error types from yt-dlp output classify_error() { local error_message="$1" case "$error_message" in *"timed out"*|*"Connection reset"*) echo "network_timeout" ;; *"429"*|*"Too Many Requests"*) echo "rate_limited" ;; *"Video unavailable"*|*"Private video"*|*"removed"*) echo "unavailable" ;; *"Requested format is not available"*) echo "format_unavailable" ;; *"No space left"*|*"Disk full"*) echo "disk_full" ;; *"Permission denied"*) echo "permission_denied" ;; *"corrupted"*|*"checksum"*) echo "corrupted" ;; *) echo "unknown" ;; esac } ``` ### Retry Logic ```bash retry_download() { local url="$1" local output_dir="$2" local format="$3" local max_retries="$4" local backoff_strategy="$5" # exponential | fixed local attempt=0 local wait_seconds=5 while [[ $attempt -lt $max_retries ]]; do attempt=$((attempt + 1)) echo "Attempt $attempt of $max_retries for $url" if download_file "$url" "$output_dir" "$format"; then echo "SUCCESS on attempt $attempt" return 0 fi # Classify error and determine if retry is appropriate local error_type=$(classify_error "$(cat "$output_dir/.curator/download.log")") if [[ "$error_type" == "disk_full" || "$error_type" == "permission_denied" ]]; then echo "FATAL ERROR: $error_type - aborting retries" return 2 fi if [[ $attempt -lt $max_retries ]]; then echo "Waiting ${wait_seconds}s before retry..." sleep "$wait_seconds" if [[ "$backoff_strategy" == "exponential" ]]; then wait_seconds=$((wait_seconds * 3)) fi fi done echo "FAILED after $max_retries attempts" return 1 } ``` ## Fallback Chain ### Quality Fallback Strategy When requested format is unavailable, gracefully degrade through quality levels: ```bash # Fallback chain for video VIDEO_FORMATS=( "bestvideo[height<=1080]+bestaudio/best[height<=1080]" # Best 1080p "bestvideo[height<=720]+bestaudio/best[height<=720]" # 720p fallback "bestvideo[height<=480]+bestaudio/best[height<=480]" # 480p fallback "best" # Any available ) # Fallback chain for audio AUDIO_FORMATS=( "bestaudio[ext=opus]/bestaudio[ext=m4a]/bestaudio" # Best audio "worstaudio" # Low quality ) try_format_fallback() { local url="$1" local output_dir="$2" local format_array=("${@:3}") for format in "${format_array[@]}"; do echo "Trying format: $format" if yt-dlp -f "$format" --output "$output_dir/%(title)s.%(ext)s" "$url"; then echo "SUCCESS with format: $format" return 0 fi done echo "FAILED: No fallback formats succeeded" return 1 } ``` ## Audio Extraction ### Batch Extraction from Video After video downloads complete, extract audio tracks for archival: ```bash extract_audio_from_videos() { local video_dir="$1" local audio_dir="$2" local format="opus" # or mp3, flac local quality="128K" find "$video_dir" -type f \( -name "*.mkv" -o -name "*.mp4" -o -name "*.webm" \) | while read -r video_file; do local basename=$(basename "$video_file" | sed 's/\.[^.]*$//') local audio_file="$audio_dir/${basename}.${format}" echo "Extracting audio: $video_file -> $audio_file" ffmpeg -i "$video_file" \ -vn \ -acodec libopus \ -b:a "$quality" \ "$audio_file" \ 2>&1 | tee "$audio_dir/.curator/extraction.log" if [[ ${PIPESTATUS[0]} -eq 0 ]]; then echo "SUCCESS: $audio_file" else echo "FAILED: $video_file" fi done } ``` ### Extraction Format Selection | Source Quality | Target Format | Bitrate | Rationale | |---------------|---------------|---------|-----------| | Lossless (FLAC) | FLAC | - | Preserve lossless | | High (320K MP3) | Opus | 128K | Better efficiency | | Medium (192K) | Opus | 96K | Acceptable quality | | Low (<192K) | Keep original | - | No benefit to transcode | ## Network Mount Rules (Field-Tested) **CRITICAL LESSONS from real-world testing**: ### Rule 1: NEVER Bulk Operations on Network Mounts **Problem**: Network mounts (NFS, SMB, sshfs) have high latency for metadata operations. Operations like `ls -la` on 1000+ files can take minutes. **Solution**: Always download to local disk first, then batch-copy to network mount. ```bash # Bad: Download directly to network mount yt-dlp -o "/mnt/network-storage/artist/%(title)s.%(ext)s" "$url" # Result: 10x slower, frequent timeouts, partial writes # Good: Download locally, then move LOCAL_WORK="/tmp/curator-$$" mkdir -p "$LOCAL_WORK" yt-dlp -o "$LOCAL_WORK/%(title)s.%(ext)s" "$url" # Then batch copy when complete rsync -av --progress "$LOCAL_WORK/" "/mnt/network-storage/artist/" ``` ### Rule 2: Batch Metadata Updates **Problem**: Writing individual .curator/ metadata files to network mount causes thrashing. **Solution**: Accumulate metadata locally, write in single batch. ```bash # Bad: Write metadata per download for file in *.opus; do echo "$file metadata" > "/mnt/network/.curator/$file.json" done # Good: Build metadata locally, copy once mkdir -p "$LOCAL_WORK/.curator" for file in *.opus; do echo "$file metadata" > "$LOCAL_WORK/.curator/$file.json" done rsync -av "$LOCAL_WORK/.curator/" "/mnt/network/.curator/" ``` ### Rule 3: Verify Before Cleanup **Problem**: Network mount disconnections can cause data loss if local copy deleted prematurely. **Solution**: Verify remote copy with checksums before deleting local. ```bash copy_to_network_mount() { local local_dir="$1" local remote_dir="$2" # Copy files rsync -av --checksum --progress "$local_dir/" "$remote_dir/" # Verify checksums match (cd "$local_dir" && find . -type f -exec sha256sum {} \; | sort) > /tmp/local-checksums (cd "$remote_dir" && find . -type f -exec sha256sum {} \; | sort) > /tmp/remote-checksums if diff /tmp/local-checksums /tmp/remote-checksums; then echo "Verification SUCCESS - safe to delete local copy" rm -rf "$local_dir" else echo "Verification FAILED - keeping local copy" return 1 fi } ``` ### Rule 4: Graceful Degradation on Mount Failure **Problem**: Network mount failures should not crash the entire acquisition session. **Solution**: Detect mount issues early, fall back to local-only mode. ```bash check_mount_health() { local mount_point="$1" # Test write performance local test_file="$mount_point/.curator-health-test" local start_time=$(date +%s) if ! timeout 5 dd if=/dev/zero of="$test_file" bs=1M count=10 2>/dev/null; then echo "Mount UNHEALTHY or SLOW - falling back to local storage" return 1 fi local end_time=$(date +%s) local duration=$((end_time - start_time)) rm -f "$test_file" if [[ $duration -gt 3 ]]; then echo "Mount performance degraded (${duration}s for 10MB write)" return 1 fi return 0 } ``` ## Quality Verification After download completion, verify file integrity: ```bash verify_download() { local file_path="$1" local expected_size="$2" # Optional # Check file exists and is readable if [[ ! -r "$file_path" ]]; then echo "FAILED: File not readable: $file_path" return 1 fi # Check file size (must be > 0) local actual_size=$(stat -f%z "$file_path" 2>/dev/null || stat -c%s "$file_path") if [[ $actual_size -eq 0 ]]; then echo "FAILED: Zero-byte file: $file_path" return 1 fi # Check expected size if provided if [[ -n "$expected_size" ]]; then local tolerance=$((expected_size / 100)) # 1% tolerance local diff=$((actual_size - expected_size)) diff=${diff#-} # Absolute value if [[ $diff -gt $tolerance ]]; then echo "WARNING: Size mismatch (expected: $expected_size, actual: $actual_size)" fi fi # Verify media file integrity with ffmpeg if command -v ffmpeg >/dev/null; then if ffmpeg -v error -i "$file_path" -f null - 2>&1 | grep -q "error"; then echo "FAILED: Media file corrupted: $file_path" return 1 fi fi echo "VERIFIED: $file_path ($actual_size bytes)" return 0 } ``` ## Final Report Generate comprehensive session report: ```bash generate_session_report() { local session_state="$1" local total=$(jq '.total_downloads' "$session_state") local completed=$(jq '.completed' "$session_state") local failed=$(jq '.failed' "$session_state") cat <<EOF ACQUISITION SESSION REPORT ========================= Session ID: $(jq -r '.session_id' "$session_state") Started: $(jq -r '.started_at' "$session_state") Completed: $(jq -r '.updated_at' "$session_state") Downloads: $total total ($completed completed, $failed failed) Successful Downloads: $(jq -r '.downloads[] | select(.status == "completed") | " - \(.output_file) (\(.filesize_bytes | tonumber / 1048576 | floor)MB)"' "$session_state") Failed Downloads: $(jq -r '.downloads[] | select(.status == "failed") | " - \(.url)\n Error: \(.error)"' "$session_state") Total Size: $(jq '[.downloads[] | select(.status == "completed") | .filesize_bytes] | add | . / 1073741824' "$session_state") GB EOF } ```