UNPKG

archon-agent-kit-prp-automated

Version:

🚀 BREAKTHROUGH: Automated PRP Generation System - No more copying templates! Generates complete, customized PRPs automatically with 85.7% quality validation. Supports all project types: Web, Mobile, API, CLI, Microservices.

349 lines (276 loc) 17.5 kB
# Dealership Website Scraper - Automotive Market Intelligence Platform ## 🎯 **Project Overview** A sophisticated, multi-platform automotive market intelligence system that automatically collects detailed vehicle inventory data from dealership websites across multiple platforms and geographic locations. The system combines web scraping automation with business intelligence capabilities for the automotive industry, featuring a template-based architecture that adapts to any dealership website while maintaining code quality and reliability. ## ❓ **Why (Business & Technical Drivers)** ### **Business Drivers** - **Market Intelligence**: Provide real-time automotive inventory data for competitive analysis and market research - **Automation**: Eliminate manual data collection from hundreds of dealership websites - **Scalability**: Support 250+ dealership websites with minimal maintenance overhead - **Data Quality**: Ensure consistent, structured data extraction across diverse website platforms - **Compliance**: Respect robots.txt and implement ethical scraping practices ### **Technical Drivers** - **Template-Based Architecture**: Enable rapid adaptation to new dealership platforms (<48h turnaround) - **Multi-Platform Support**: Handle diverse website technologies (Dealer.com, Dealer Inspire, DealerOn, etc.) - **Performance Optimization**: Prioritize fast JSON-LD extraction over heavy browser automation - **Reliability**: Implement multi-level fallback systems for robust data collection - **Historical Tracking**: Maintain complete inventory change history for trend analysis ## 🏗️ **Architecture & Technology Stack** ### **System Architecture** **Dual-Component System** 1. **Backend Scraping Engine** - Python-based web scraper (requests + extruct for structured data, Playwright/Selenium as fallback) 2. **Frontend Management Interface** - Modern Next.js web application for managing and monitoring scraping operations ### **Technology Stack** - **Frontend**: Next.js 15.4.6, React 19.1.0, TypeScript, Tailwind CSS v4 - **Backend**: Python 3.x, Flask/FastAPI, Requests, Playwright (primary headless browser fallback), Selenium (specific cases where Playwright support is incomplete) - **Database**: Supabase (PostgreSQL) for historical tracking and data persistence - **Data Processing**: CSV + JSON output, regex parsing, location mapping, Supabase integration - **UI Components**: Headless UI, Heroicons, React Hook Form, Zod validation - **Infrastructure**: Docker containerization, scheduled execution (America/Chicago timezone) --- ## 🚀 **Core Features & Capabilities** ### **Primary Functionality** - **Universal Dealership Scraping**: Scrapes from multiple dealership websites simultaneously with platform detection - **Template-Based Architecture**: Supports different dealership website platforms automatically through configurable templates - **Intelligent Data Extraction**: Prioritizes structured data (JSON-LD) with fallback to DOM parsing - **Real-Time Monitoring**: Admin portal for managing scraping operations and viewing results ### **Key Capabilities** - **Multi-Website Support**: Handle 250+ dealership websites with parallel processing - **Platform Detection**: Automatically identify website platforms using HTML/CDN/script markers - **Data Validation**: VIN validation, price/mileage normalization, and data quality checks - **Export Functionality**: Generate CSV and JSON exports for client use and analysis - **Detection Scoring**: Comprehensive evaluation system to assess scraping confidence and reliability ### **Detection Capability** Before scraping begins, the system performs a detection pass to determine the structure and schema of the dealership website. This detection process includes: - **Page Type Identification**: Determining if the page is an SRP (Search Results Page) or VDP (Vehicle Detail Page) - **Platform Detection**: Identifying which dealership website platform is in use (e.g., DealerOn, Dealer.com, etc.) - **Structured Data Detection**: Detecting the presence of structured metadata (e.g., JSON-LD Vehicle objects) - **Field Mapping**: Mapping candidate selectors for key vehicle data fields (VIN, stock number, year, make, model, location) #### **Detection Scoring Logic** To ensure reliability, each site goes through a detection scoring system that evaluates: - **Structured Data Presence** (e.g., JSON-LD, Microdata): +3 points - **Platform Match Certainty** (based on layout patterns or classnames): +2 points - **Field Coverage** (number of required fields confidently identified): +1 per field (up to 6 points) - **VDP Link Discovery from SRP**: +2 points - **Successful Test Parse** (single vehicle extraction test run): +2 points **Max Score: 15** - Sites scoring **13 or above** are marked as **high-confidence** - Sites scoring **10–12** are marked as **partial-confidence** - Sites below **10** are excluded unless manually reviewed Scoring results are logged alongside the domain metadata for traceability. ### **Platform Support** **Tier 1 (Ubiquitous)** - **Dealer.com** - `dealer.com`, `dealer.net` domains - **Dealer Inspire** - `dealerinspire.com` domains - **DealerOn** - `dealeron.com` domains - **Sincro (ex-Cobalt/CDK Digital)** - `sincro.com`, `cobalt.com` domains **Tier 2 (Edge Cases)** - **Reynolds & Reynolds** - `reyrey.com`, `nakedlime.com` - **VinSolutions** - `vinsolutions.com`, `vinmanager.com` - **CDK Global Fortellis APIs** - `cdk.com`, `fortellis.com` (if accessible via dealers — not a site template, but sometimes a data source) - **Generic Template** - Unknown platforms (fallback) --- ## 📊 **Data & Processing Framework** ### **Core Data Entities** - **Vehicle**: VIN, Year, Make, Model, Trim, Stock Number, Price, Mileage, Images, Dealer Location, URL - **Dealership**: Name, Location (City, State, ZIP), Platform Template, Scraping Configuration - **Scraping Session**: Timestamp, VIN Count, New/Removed VINs, Error Logs, Performance Metrics ### **Data Processing Model** - **Input Sources**: Dealership websites, sitemap.xml, robots.txt, vehicle detail pages - **Processing Logic**: Detection pass → platform detection → template selection → structured data extraction → fallback parsing → validation → storage - **Output Formats**: JSON, CSV exports, Supabase database storage - **Storage Strategy**: Historical tracking with `first_seen_at`, `last_seen_at`, and `active` flags ### **Data Validation & Quality** - **Validation Rules**: VIN regex + check digit validation, year range checks, price/mileage normalization, 10th VIN character encodes model year (use as fallback if year missing) - **Error Handling**: Multi-level fallback system, graceful degradation for individual failures - **Data Integrity**: Duplicate detection, historical change tracking, data consistency checks ### **Structured Data Extraction** - Prioritize **JSON-LD `@type=Vehicle`** blocks - Fallbacks: `Product` objects with VIN in `additionalProperty`, then Microdata, RDFa, Open Graph - If structured data missing: use platform-specific selectors #### **Example JSON-LD Block** ```html <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Vehicle", "vehicleIdentificationNumber": "1FTFW1EF1EFA00001", "brand": { "@type": "Brand", "name": "Ford" }, "model": "F-150", "vehicleModelDate": "2021", "sku": "F12345", "offers": { "@type": "Offer", "price": "33995", "priceCurrency": "USD" }, "mileage": { "@type": "QuantitativeValue", "value": "48210", "unitCode": "SMI" }, "seller": { "@type": "Organization", "name": "Example Ford", "address": { "@type": "PostalAddress", "addressLocality": "Austin", "addressRegion": "TX", "postalCode": "78701" } } } </script> ``` --- ## 🔍 **Core Workflows & Processes** ### **Primary Workflows** 1. **Dealership Onboarding**: URL submission → detection pass → platform detection → template assignment → initial crawl → validation 2. **Scheduled Scraping**: Daily execution (7 AM & 12:30 PM CT) → parallel processing → data extraction → storage → reporting 3. **Data Export**: Admin portal → filter selection → format selection → export generation → download ### **Process Hierarchy** - **Priority 1**: JSON-LD structured data extraction (fastest, most reliable) - **Priority 2**: Microdata and embedded JSON extraction (medium speed, good reliability) - **Priority 3**: DOM selector parsing (slower, platform-specific) - **Fallback**: Playwright/Selenium rendering (slowest, only for JS-heavy sites) ### **Reconnaissance & Platform Detection** - Parse `sitemap.xml` and `robots.txt` to enumerate URLs - Identify and classify listing pages vs vehicle detail pages - Detect platform using HTML/CDN/script markers, cookies, and footer credits - Perform comprehensive detection scoring to assess scraping confidence - Map candidate selectors and structured data availability before extraction ### **Scraping Hierarchy** 1. JSON-LD (`Vehicle` / `Product`) 2. Microdata 3. Embedded JSON (`window.__INITIAL_STATE__`) 4. DOM selectors (per platform template) 5. Playwright/Selenium rendering (last resort) --- ## 🛡️ **System Architecture & Reliability** ### **Core Components** - **Scraping Engine**: Python backend with template-based extraction logic - **Admin Portal**: Next.js frontend for management and monitoring - **Database**: Supabase for data persistence and historical tracking - **Scheduler**: Timezone-aware execution system (America/Chicago) ### **Integration Points** - **External APIs**: Dealership websites, Supabase API - **Internal Services**: Template engine, validation system, export generator - **Data Flow**: URL discovery → platform detection → template selection → data extraction → validation → storage → reporting ### **Template-Based Architecture** Each platform template defines: - VDP discovery (inventory pages, pagination) - Data extraction rules (selectors, JSON-LD schema expectations) - Fallback strategies ### **Admin Portal** - Manage dealership URLs - Assign platform template or force generic fallback - View last crawl results: - VIN count - New/removed VINs since last crawl - Diff view showing new/removed VINs since last crawl, and highlighting price/mileage changes - Error logs per dealer - Export CSV or JSON for client use --- ## 📈 **Performance & Scalability** - **Performance Targets**: - Default: **HTTP + JSON-LD** (fast, low overhead) - Fallback: Playwright (primary headless browser) only when strictly required, Selenium in specific cases - Parallel processing with rate limits (1–2 rps per site) - **Scaling Strategy**: Template-based architecture enables rapid platform addition - **Resource Requirements**: Minimal for JSON-LD extraction, moderate for browser automation - **Optimization Approach**: Prioritize structured data, minimize browser automation, parallel processing --- ## 🛡️ **Error Handling & Reliability** - **Error Prevention**: Multi-level fallback system, comprehensive validation, rate limiting - **Error Recovery**: Graceful degradation, individual dealer failure isolation, retry mechanisms - **Monitoring & Alerting**: Errors per dealer logged, failures above threshold flagged in admin portal - **Graceful Degradation**: Skip individual failed dealers, continue batch processing --- ## 📚 **Technical Requirements** - **Development Environment**: Python 3.x, Node.js 18+, Docker - **Runtime Environment**: Python backend, Next.js frontend, Supabase database - **Dependencies**: Requests, Playwright (primary), Selenium (fallback), Next.js, React, Tailwind CSS - **Configuration**: Environment variables for API keys, database connections, scheduling ### **Constraints & Limitations** - **Technical Constraints**: Must respect robots.txt and reasonable rate limiting - **Business Constraints**: Templates must be modular and easily updated (<48h turnaround) - **Performance Constraints**: Scheduler pinned to **America/Chicago** time for 7 AM & 12:30 PM runs - **Security Constraints**: Secure API key management, rate limiting, ethical scraping practices --- ## 🔧 **Implementation Guidelines** ### **Code Quality Standards** - **Coding Standards**: Python PEP 8, TypeScript strict mode, consistent naming conventions - **Testing Requirements**: Unit tests for extraction logic, integration tests for end-to-end workflows - **Documentation**: Comprehensive API documentation, template development guides - **Code Review**: Mandatory review for template changes, automated testing for core logic ### **Development Workflow** - **Version Control**: Git with feature branches, semantic versioning - **Deployment**: Docker containerization, automated deployment pipeline - **Monitoring**: Real-time scraping status, error logging, performance metrics - **Maintenance**: Regular template updates, performance optimization, security updates --- ## 📋 **Project Success Criteria** ### **Functional Requirements** - [ ] Successfully scrape from dealership websites - [ ] Extract all required vehicle data fields with 95%+ accuracy - [ ] Support all major dealership platforms (Dealer.com, Dealer Inspire, DealerOn, Sincro) - [ ] Provide real-time monitoring and management through admin portal - [ ] Generate accurate CSV and JSON exports for client use ### **Non-Functional Requirements** - [ ] Achieve <2 second response time for admin portal operations (scraping runs will complete within batch windows - minutes, not seconds) - [ ] Maintain 99%+ uptime for scheduled scraping operations - [ ] Scale to handle 50+ dealership websites - [ ] Scale to handle 200+ dealership websites - [ ] Scale to handle 500+ dealership websites - [ ] Implement comprehensive error handling and recovery ### **Acceptance Criteria** - [ ] All platform templates successfully extract vehicle data - [ ] Admin portal provides complete scraping management capabilities - [ ] Data export functionality works for all supported formats - [ ] System maintains performance under load (250+ concurrent sites) - [ ] Error handling prevents system failures from individual site issues --- ## 🚫 **Anti-Patterns & Pitfalls to Avoid** ### **Common Mistakes** -**Over-reliance on browser automation**: Use Playwright (primary) and Selenium (fallback) only when absolutely necessary -**Hard-coded selectors**: Always use template-based approach for maintainability -**Ignoring rate limits**: Respect robots.txt and implement proper delays -**Single point of failure**: Implement fallback systems for all critical components ### **Design Anti-Patterns** -**Monolithic scraping logic**: Use template-based architecture for platform flexibility -**Synchronous processing**: Implement parallel processing with proper rate limiting -**Poor error handling**: Implement comprehensive error handling and recovery -**Inadequate validation**: Validate all extracted data before storage --- ## 📖 **Additional Resources & References** ### **Documentation** - [Schema.org Vehicle Schema](https://schema.org/Vehicle) - Official vehicle data structure - [Dealer.com Developer Resources](https://www.dealer.com/developers/) - Platform-specific documentation - [Next.js Documentation](https://nextjs.org/docs) - Frontend framework documentation - [Supabase Documentation](https://supabase.com/docs) - Database and API documentation ### **Examples & Patterns** - [Web Scraping Best Practices](https://www.scraperapi.com/blog/web-scraping-best-practices/) - Ethical scraping guidelines - [JSON-LD Implementation Examples](https://developers.google.com/search/docs/advanced/structured-data/intro-structured-data) - Structured data examples - [Python Web Scraping Patterns](https://realpython.com/python-web-scraping-practical-introduction/) - Python scraping techniques ### **Tools & Services** - [Playwright Documentation](https://playwright.dev/) - Browser automation framework - [Extruct Library](https://github.com/scrapinghub/extruct) - Structured data extraction - [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing library - [Requests Library](https://requests.readthedocs.io/) - HTTP library for Python --- ## 🔄 **Template Usage Instructions** ### **How to Use This Template** This INITIAL.md file serves as the foundation for the Dealership Website Scraper project and will be used by the PRP generation system to create detailed implementation plans. ### **Required Customizations** - **Project Name**: Dealership Website Scraper - Automotive Market Intelligence Platform - **Technology Stack**: Python backend, Next.js frontend, Supabase database - **Features**: Multi-platform scraping, template-based architecture, real-time monitoring - **Requirements**: Support 250+ dealership websites, maintain data quality, ensure reliability - **Constraints**: Ethical scraping practices, performance optimization, rapid template updates ### **Next Steps** 1. Review and validate all sections for completeness 2. Use this file to generate detailed PRPs for implementation 3. Begin with core scraping engine development 4. Implement template-based architecture 5. Build admin portal and monitoring system --- *This INITIAL.md file provides comprehensive project context for the Dealership Website Scraper system and enables the generation of detailed implementation PRPs.*