aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

689 lines (536 loc) • 20 kB

Markdown

--- description: Download media from discovered sources with format selection and progress tracking category: media-curator argument-hint: --plan <sources.yaml> | --url <URL> [--format audio|video|best] [--output <dir>] [--parallel N] allowed-tools: Bash, Read, Write, Glob, Grep model: sonnet --- # /acquire Download media from discovered sources with intelligent format selection, parallel execution, and comprehensive progress tracking. ## Purpose The `/acquire` command orchestrates media downloads from various sources (YouTube, Internet Archive, Bandcamp, direct links) using the Acquisition Manager agent. It handles: - Single URL or batch downloads from source plans - Automatic format selection based on content type - Parallel download execution with resource limits - Real-time progress tracking and status reporting - Error recovery with retry strategies - Quality verification and metadata recording - Network mount awareness for optimal performance ## Parameters ### Required (one of) **`--plan <sources.yaml>`** - Source plan file generated by `/find-sources` command - YAML format containing URLs, metadata, and acquisition strategy - Example: `.curator/sources/plan-001.yaml` **`--url <URL>`** - Single URL for one-off downloads - Supports: YouTube, Internet Archive, Bandcamp, SoundCloud, Vimeo, direct links - Example: `https://youtube.com/watch?v=abc123` ### Optional **`--format <audio|video|best>`** - Format preference for downloads - `audio`: Extract audio only (Opus 128K default) - `video`: Best video up to 1080p with audio - `best`: Let agent decide based on content type (default) - Example: `--format audio` **`--output <directory>`** - Base output directory for downloads - Default: `./downloads/<timestamp>` - Structure: `<output>/<artist>/<era>/audio/` and `/video/` - Example: `--output /mnt/archive/music` **`--parallel <N>`** - Maximum concurrent downloads - Range: 1-5 (default: 3) - Lower for slow networks, higher for fast connections - Example: `--parallel 5` **`--local-work <directory>`** - Local working directory (required for network mount outputs) - Downloads happen here first, then copied to output - Default: `/tmp/curator-work-$$` - Example: `--local-work /fast-local-disk/tmp` **`--verify-after`** - Enable post-download verification - Checks file integrity, size, media validity - Adds ~5% overhead but prevents corruption - Example: `--verify-after` **`--extract-audio`** - Automatically extract audio from downloaded videos - Creates parallel audio files in audio/ directory - Uses Opus 128K by default - Example: `--extract-audio` **`--resume <session-id>`** - Resume interrupted download session - Loads state from `.curator/sessions/<session-id>/state.json` - Skips completed downloads, retries failed - Example: `--resume 20260214-143022` ## Usage Examples ### Single URL Download ```bash # Download single YouTube video (audio only) /acquire --url "https://youtube.com/watch?v=abc123" --format audio # Download single video with metadata /acquire --url "https://youtube.com/watch?v=abc123" \ --format video \ --output /mnt/archive/concerts \ --verify-after ``` ### Batch Download from Plan ```bash # Download entire source plan (generated by /find-sources) /acquire --plan .curator/sources/plan-001.yaml # Download to network mount (uses local working dir) /acquire --plan .curator/sources/plan-001.yaml \ --output /mnt/network-storage/archive \ --local-work /fast-ssd/curator-work \ --parallel 3 ``` ### Audio Extraction Workflow ```bash # Download videos and auto-extract audio /acquire --plan .curator/sources/plan-001.yaml \ --format video \ --extract-audio \ --verify-after # Result structure: # downloads/artist/era/video/concert.mkv # downloads/artist/era/audio/concert.opus ``` ### Resume Interrupted Session ```bash # Resume after network failure /acquire --resume 20260214-143022 # Resume with different output (move completed files) /acquire --resume 20260214-143022 \ --output /new/location ``` ## Workflow ### 1. Initialization ```bash # Validate parameters - Check --plan exists OR --url provided - Verify output directory is writable - Check required tools (yt-dlp, wget, curl, ffmpeg) - Test network mount performance if applicable - Create session directory structure # Session setup SESSION_ID=$(date +%Y%m%d-%H%M%S) SESSION_DIR=".curator/sessions/$SESSION_ID" mkdir -p "$SESSION_DIR"/{logs,metadata} # Initialize state file cat > "$SESSION_DIR/state.json" <<EOF { "session_id": "$SESSION_ID", "started_at": "$(date -Iseconds)", "status": "initializing", "parameters": { "plan": "$PLAN_FILE", "format": "$FORMAT", "output": "$OUTPUT_DIR", "parallel": $PARALLEL }, "downloads": [] } EOF ``` ### 2. Source Validation ```bash # If --plan provided, parse YAML if [[ -n "$PLAN_FILE" ]]; then # Extract URLs and metadata SOURCES=$(yq eval '.sources[] | .url' "$PLAN_FILE") TOTAL_SOURCES=$(echo "$SOURCES" | wc -l) # Validate each URL is accessible while IFS= read -r url; do if yt-dlp --dump-json "$url" >/dev/null 2>&1; then echo "VALID: $url" else echo "WARNING: Cannot access $url" fi done <<< "$SOURCES" fi # If --url provided, validate single URL if [[ -n "$SINGLE_URL" ]]; then if ! yt-dlp --dump-json "$SINGLE_URL" >/dev/null 2>&1; then echo "ERROR: Cannot access URL: $SINGLE_URL" exit 1 fi fi ``` ### 3. Directory Structure Creation ```bash create_acquisition_structure() { local output_base="$1" local artist="$2" local era="$3" # Sanitize directory names local safe_artist=$(echo "$artist" | sed 's/[^a-zA-Z0-9_-]/_/g') local safe_era=$(echo "$era" | sed 's/[^a-zA-Z0-9_-]/_/g') # Create directory tree local audio_dir="$output_base/$safe_artist/$safe_era/audio" local video_dir="$output_base/$safe_artist/$safe_era/video" mkdir -p "$audio_dir/.curator" mkdir -p "$video_dir/.curator" mkdir -p "$output_base/$safe_artist/.curator" # Write artist metadata cat > "$output_base/$safe_artist/.curator/artist-info.json" <<EOF { "name": "$artist", "era": "$era", "created_at": "$(date -Iseconds)", "session_id": "$SESSION_ID" } EOF echo "$audio_dir:$video_dir" } ``` ### 4. Launch Downloads ```bash # Download orchestration with concurrency control ACTIVE_DOWNLOADS=() MAX_CONCURRENT=${PARALLEL:-3} launch_download() { local url="$1" local output_dir="$2" local format="$3" local download_id="dl-$SESSION_ID-$(date +%s)" # Wait for slot availability while [[ ${#ACTIVE_DOWNLOADS[@]} -ge $MAX_CONCURRENT ]]; do check_completed_downloads sleep 2 done # Determine target directory (audio vs video) local target_dir if [[ "$format" == "audio" || "$format" == "bestaudio" ]]; then target_dir="$output_dir/audio" else target_dir="$output_dir/video" fi # Launch download in background ( download_with_retry "$url" "$target_dir" "$format" "$download_id" echo "COMPLETED:$download_id:$?" >> "$SESSION_DIR/completion.log" ) & local pid=$! ACTIVE_DOWNLOADS+=("$download_id:$pid") # Update state file update_state_file "add" "$download_id" "$url" "in_progress" } download_with_retry() { local url="$1" local target_dir="$2" local format="$3" local download_id="$4" local attempt=0 local max_attempts=3 while [[ $attempt -lt $max_attempts ]]; do attempt=$((attempt + 1)) echo "[$download_id] Attempt $attempt/$max_attempts" if yt-dlp -f "$format" \ --output "$target_dir/%(title)s.%(ext)s" \ --write-info-json \ --write-thumbnail \ --no-playlist \ "$url" \ 2>&1 | tee "$SESSION_DIR/logs/$download_id.log"; then update_state_file "complete" "$download_id" "" "completed" return 0 fi # Check error type local error_type=$(classify_error "$SESSION_DIR/logs/$download_id.log") if [[ "$error_type" == "disk_full" || "$error_type" == "permission_denied" ]]; then echo "FATAL: $error_type - aborting all downloads" pkill -P $$ # Kill all child processes exit 2 fi if [[ $attempt -lt $max_attempts ]]; then local wait_time=$((5 * attempt)) echo "Waiting ${wait_time}s before retry..." sleep "$wait_time" fi done update_state_file "fail" "$download_id" "" "failed" return 1 } check_completed_downloads() { local new_active=() for entry in "${ACTIVE_DOWNLOADS[@]}"; do local pid="${entry#*:}" if kill -0 "$pid" 2>/dev/null; then new_active+=("$entry") fi done ACTIVE_DOWNLOADS=("${new_active[@]}") } ``` ### 5. Track Progress ```bash # Real-time progress monitoring monitor_progress() { local session_dir="$1" while true; do # Load current state local state_file="$session_dir/state.json" local total=$(jq '.downloads | length' "$state_file") local completed=$(jq '[.downloads[] | select(.status == "completed")] | length' "$state_file") local in_progress=$(jq '[.downloads[] | select(.status == "in_progress")] | length' "$state_file") local failed=$(jq '[.downloads[] | select(.status == "failed")] | length' "$state_file") # Clear screen and display status clear cat <<EOF ACQUISITION PROGRESS ==================== Session: $SESSION_ID Started: $(jq -r '.started_at' "$state_file") Status: $completed/$total completed ($failed failed, $in_progress in progress) Active Downloads: EOF # Show active download details jq -r '.downloads[] | select(.status == "in_progress") | " [$.progress_percent)%] \(.filename) @ \(.speed_mbps)MB/s (ETA: \(.eta_seconds)s)"' "$state_file" # Exit if all done if [[ $((completed + failed)) -eq $total ]]; then echo "" echo "All downloads completed." break fi sleep 5 done } # Run in background monitor_progress "$SESSION_DIR" & MONITOR_PID=$! ``` ### 6. Extract Audio (if requested) ```bash if [[ "$EXTRACT_AUDIO" == "true" ]]; then echo "Extracting audio from video files..." find "$OUTPUT_DIR" -type d -name "video" | while read -r video_dir; do local audio_dir="${video_dir%/video}/audio" find "$video_dir" -type f \( -name "*.mkv" -o -name "*.mp4" -o -name "*.webm" $ | while read -r video_file; do local basename=$(basename "$video_file" | sed 's/\.[^.]*$//') local audio_file="$audio_dir/${basename}.opus" echo " $video_file -> $audio_file" ffmpeg -i "$video_file" \ -vn \ -acodec libopus \ -b:a 128K \ "$audio_file" \ 2>&1 | tee -a "$SESSION_DIR/logs/audio-extraction.log" if [[ ${PIPESTATUS[0]} -eq 0 ]]; then echo "SUCCESS: $audio_file" else echo "FAILED: $video_file" fi done done fi ``` ### 7. Verify Downloads (if requested) ```bash if [[ "$VERIFY_AFTER" == "true" ]]; then echo "Verifying downloaded files..." local verification_results="$SESSION_DIR/verification.json" echo '{"verified": [], "failed": []}' > "$verification_results" find "$OUTPUT_DIR" -type f $ -name "*.opus" -o -name "*.mkv" -o -name "*.mp4" -o -name "*.flac" $ | while read -r file; do echo "Verifying: $file" # Check file size local size=$(stat -c%s "$file" 2>/dev/null || stat -f%z "$file") if [[ $size -eq 0 ]]; then echo "FAILED: Zero-byte file" jq ".failed += [\"$file\"]" "$verification_results" > "$verification_results.tmp" mv "$verification_results.tmp" "$verification_results" continue fi # Check media integrity with ffmpeg if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then echo "FAILED: Corrupted media file" jq ".failed += [\"$file\"]" "$verification_results" > "$verification_results.tmp" mv "$verification_results.tmp" "$verification_results" continue fi echo "VERIFIED: $file ($size bytes)" jq ".verified += [\"$file\"]" "$verification_results" > "$verification_results.tmp" mv "$verification_results.tmp" "$verification_results" done local verified_count=$(jq '.verified | length' "$verification_results") local failed_count=$(jq '.failed | length' "$verification_results") echo "" echo "Verification complete: $verified_count verified, $failed_count failed" fi ``` ### 8. Copy to Network Mount (if applicable) ```bash if [[ -n "$LOCAL_WORK" && "$OUTPUT_DIR" != "$LOCAL_WORK"* ]]; then echo "Copying from local work directory to network mount..." # Batch copy with progress rsync -av --progress \ --exclude='.DS_Store' \ --exclude='Thumbs.db' \ "$LOCAL_WORK/" "$OUTPUT_DIR/" # Verify checksums echo "Verifying network copy..." (cd "$LOCAL_WORK" && find . -type f -exec sha256sum {} \; | sort) > /tmp/local-checksums (cd "$OUTPUT_DIR" && find . -type f -exec sha256sum {} \; | sort) > /tmp/remote-checksums if diff /tmp/local-checksums /tmp/remote-checksums; then echo "Verification SUCCESS - removing local working copy" rm -rf "$LOCAL_WORK" else echo "WARNING: Checksum mismatch - keeping local copy at $LOCAL_WORK" fi fi ``` ## Progress Display Format Real-time console output during acquisition: ``` ACQUISITION PROGRESS ==================== Session: 20260214-143022 Started: 2026-02-14T14:30:22Z Status: 5/12 completed (1 failed, 3 in progress) Active Downloads: [67%] concert-1.mkv @ 12.5MB/s (ETA: 245s) [23%] album.flac @ 8.2MB/s (ETA: 680s) [91%] podcast-ep5.opus @ 2.1MB/s (ETA: 45s) Recent Completions: [✓] documentary.mp4 (1.2GB, 8m 34s) [✓] live-set.opus (85MB, 1m 12s) Recent Failures: [✗] unavailable-video.mkv (Video unavailable - marked for alternate source) Total Downloaded: 4.8GB / ~12.5GB estimated Average Speed: 8.9MB/s Estimated Completion: 14:58 (28 minutes remaining) ``` ## Error Handling ### Common Errors and Responses | Error | Detection | Response | User Action Required | |-------|-----------|----------|---------------------| | **URL inaccessible** | Pre-download validation fails | Skip URL, mark for manual review | Check URL, update source plan | | **Network timeout** | Download stalls for 60s | Retry with exponential backoff (3x) | Check network connection | | **Rate limited** | HTTP 429 response | Wait 60s, retry (2x) | Reduce parallel count | | **Disk full** | Write fails with ENOSPC | STOP all downloads, escalate | Free disk space | | **Format unavailable** | yt-dlp format error | Try fallback formats | Accept lower quality or skip | | **Corrupted download** | ffmpeg verification fails | Delete and retry (2x) | Report to source | | **Permission denied** | Write fails with EACCES | STOP, escalate | Fix permissions | | **Mount failure** | Network mount timeout | Fall back to local-only mode | Check mount health | ### Error Log Structure ```bash # Per-download error log: .curator/sessions/<session-id>/logs/<download-id>.log [2026-02-14 14:35:22] Starting download: https://youtube.com/watch?v=abc123 [2026-02-14 14:35:23] Format selected: bestvideo[height<=1080]+bestaudio [2026-02-14 14:37:45] ERROR: HTTP Error 429: Too Many Requests [2026-02-14 14:37:45] Classified as: rate_limited [2026-02-14 14:37:45] Applying retry strategy: wait 60s (attempt 1/2) [2026-02-14 14:38:45] Retrying download... [2026-02-14 14:42:10] Download completed: concert-1.mkv (1.2GB) ``` ### Session Failure Recovery ```bash # Detect stale sessions and offer recovery detect_incomplete_sessions() { find .curator/sessions -name "state.json" -mtime -7 | while read -r state_file; do local status=$(jq -r '.status' "$state_file") local session_id=$(jq -r '.session_id' "$state_file") if [[ "$status" == "in_progress" ]]; then local completed=$(jq '[.downloads[] | select(.status == "completed")] | length' "$state_file") local total=$(jq '.downloads | length' "$state_file") echo "Incomplete session detected: $session_id ($completed/$total completed)" echo "Resume with: /acquire --resume $session_id" fi done } ``` ## Report Generation ### Final Session Report ```bash generate_acquisition_report() { local session_dir="$1" local state_file="$session_dir/state.json" local report_file="$session_dir/report.md" cat > "$report_file" <<EOF # Acquisition Session Report **Session ID**: $(jq -r '.session_id' "$state_file") **Started**: $(jq -r '.started_at' "$state_file") **Completed**: $(date -Iseconds) **Duration**: $(calculate_duration "$(jq -r '.started_at' "$state_file")" "$(date -Iseconds)") ## Summary - **Total Downloads**: $(jq '.downloads | length' "$state_file") - **Completed**: $(jq '[.downloads[] | select(.status == "completed")] | length' "$state_file") - **Failed**: $(jq '[.downloads[] | select(.status == "failed")] | length' "$state_file") - **Total Size**: $(jq '[.downloads[] | select(.status == "completed") | .filesize_bytes] | add | . / 1073741824' "$state_file") GB ## Successful Downloads $(jq -r '.downloads[] | select(.status == "completed") | "- \(.filename) (\(.filesize_bytes | tonumber / 1048576 | floor)MB)"' "$state_file") ## Failed Downloads $(jq -r '.downloads[] | select(.status == "failed") | "- \(.url)\n Error: \(.error)"' "$state_file") ## Parameters - **Source Plan**: $(jq -r '.parameters.plan // "N/A"' "$state_file") - **Format Preference**: $(jq -r '.parameters.format' "$state_file") - **Output Directory**: $(jq -r '.parameters.output' "$state_file") - **Parallel Downloads**: $(jq -r '.parameters.parallel' "$state_file") ## File Locations - **Session Directory**: $session_dir - **Download Logs**: $session_dir/logs/ - **Metadata**: $session_dir/metadata/ - **Verification Results**: $session_dir/verification.json EOF echo "Report generated: $report_file" cat "$report_file" } ``` ### Metadata Export ```bash # Export metadata for external tools export_metadata() { local session_dir="$1" local export_file="$session_dir/metadata-export.json" jq '{ session_id: .session_id, started_at: .started_at, downloads: [ .downloads[] | select(.status == "completed") | { url: .url, filename: .filename, format: .format, filesize_bytes: .filesize_bytes, duration_seconds: .duration_seconds, checksum_sha256: .checksum_sha256 } ] }' "$session_dir/state.json" > "$export_file" echo "Metadata exported: $export_file" } ``` ## Integration with Other Commands ### Typical Workflow ```bash # 1. Discover sources /find-sources --artist "Pink Floyd" --era "1970s" --sources youtube,archive # 2. Acquire media from discovered sources /acquire --plan .curator/sources/plan-001.yaml \ --format video \ --extract-audio \ --verify-after \ --output /mnt/archive/pink-floyd # 3. Extract metadata (runs automatically or manually) /extract-metadata --source /mnt/archive/pink-floyd/The_Wall/audio # 4. Organize final collection /organize --source /mnt/archive/pink-floyd ``` ## Performance Considerations ### Network Mount Optimization - **Download to local first**: Always use `--local-work` when output is network mount - **Batch copy**: Single rsync after all downloads complete - **Verify checksums**: Before deleting local copy - **Concurrent limit**: Reduce `--parallel` for network mounts (recommend 2) ### Disk Space Management - **Pre-flight check**: Estimate total size from source plan - **Monitor during**: Watch for disk space warnings - **Cleanup strategy**: Remove verified local copies after network copy ### Network Bandwidth - **Parallel tuning**: More concurrent downloads for high bandwidth - **Rate limiting**: Add `--limit-rate` to yt-dlp for shared networks - **Peak hours**: Schedule large downloads for off-peak times ## Security Considerations - **URL validation**: Never execute arbitrary commands from URLs - **Directory traversal**: Sanitize all path components - **Disk quota**: Respect user/system quotas - **Network access**: Only connect to explicitly allowed domains - **Credential handling**: Never log or expose API keys/tokens