UNPKG

@agixbt/elizascraper

Version:

Awesome Scraper for eliza, scrape docs, tweets, and tokens

305 lines (247 loc) 6.48 kB
# Eliza Scraper A powerful scraping and RAG (Retrieval Augmented Generation) toolkit for collecting, processing, and serving blockchain ecosystem data. Built with Bun, TypeScript, and modern vector search capabilities. ## Features - 🤖 Automated Twitter data scraping with engagement filtering - 📊 Real-time token price and market data tracking via CoinGecko - 📑 Documentation site scraping with caching - 🔄 Configurable cron jobs for automated data collection - 🔍 Vector search with time-decay relevance scoring - 🚀 REST API with Swagger documentation ## Architecture ### RAG Pipeline for Tweet Processing ```mermaid graph TD A[Twitter] --> B[Apify Scraper] B --> C[Filters] C --> D[Text Processing] D --> E[OpenAi Embedding] E --> F[Qdrant Vector DB] F --> G[Search API] subgraph "filters" C --> H[Min Replies] C --> I[Min Retweets] C --> J[Search Tags] C --> K[Kaito Most Mindshare handles] end subgraph "Vector Search" F --> L[Semantic Search] end ``` ### System Architecture ```mermaid graph TB CLI[CLI Interface] --> Services API[REST API] --> Services subgraph "Services" Tweets[Tweet Service] Tokens[Token Service] Docs[Site Service] end subgraph "Storage" Qdrant[(Qdrant Vector DB)] Postgres[(PostgreSQL)] end Services --> Storage ``` ## Scraper Quick Start 1. Install globally: ```bash npm i @agixbt/elizascraper ``` 2. Create `.env` file: ```env # Required APIs APIFY_API_KEY=your_apify_key COINGECKO_API_KEY=your_coingecko_key OPENAI_API_KEY= # Database DATABASE_URL=postgresql://user:password@localhost:5432/token_scraper # Vector DB QDRANT_URL=http://localhost:6333 QDRANT_COLLECTION_NAME=twitter_embeddings # Optional TWITTER_SEARCH_TAGS=berachain,bera APP_TOPICS=berachain launch,berachain token ``` 3. Start required services: ```bash docker-compose up -d ``` ## CLI Commands ### Start API Server ```bash elizascraper server --port <number> ``` ### Tweet Scraping Service ```bash elizascraper tweets-cron [options] Options: --port Port number (default: 3000) --tags Custom search tags, can also pass twitter search strings, check out https://github.com/igorbrigadir/twitter-advanced-search (comma-separated) --start Start date (YYYY-MM-DD) --end End date (YYYY-MM-DD) --min-replies Minimum replies threshold (default: 4) --retweets Minimum retweets threshold (default: 2) --max-items Maximum tweets to fetch (default: 200) ``` Disclaimer: make sure you don't run the cron job in a small time interval, with start and end date less than 3 days, fetching less than 50 tweets, would lead to a ban shortly on your apify key. To prevent ban, same search query, ran multiple times inside an hour would be skipped over. ### Tracking Latest Token Data ```bash elizascraper token-cron [options] Options: -p, --port <number> Port to listen on (default: 3008) -i, --interval <number> Interval in minutes (default: 60) --currency <string> Currency to track (default: 'usd') --cg_category <string> CoinGecko category (default: 'berachain-ecosystem') ``` ### Documentation Scraping Service ```bash elizascraper site-cron [options] Options: -p, --port <number> Port to listen on (default: 3010) -i, --interval <number> Interval in minutes (default: 60) -l, --link <string> URL to scrape (default: 'https://docs.berachain.com') ``` ## API Documentation ### Tweets API #### GET /tweets/search Search for tweets based on semantic similarity. **Query Parameters:** - `searchString` (required): Text to search for semantically similar tweets **Response Schema:** ```typescript { status: "success" | "error", results: { tweet: { id: string, text: string, fullText: string, author: { name: string, userName: string, profilePicture: string }, createdAt: string, retweetCount: number, replyCount: number, likeCount: number }, combinedScore: number, originalScore: number, date: string }[] } ``` #### GET /tweets/random Get random tweets by predefined topics. **Response Schema:** ```typescript { topic: string, tweet: { tweet: { id: string, text: string, fullText: string, author: { name: string, userName: string, profilePicture: string }, createdAt: string, retweetCount: number, replyCount: number, likeCount: number }, combinedScore: number, originalScore: number, date: string }[] } ``` ### Tokens API #### GET /tokens Get token market data. **Query Parameters:** - `symbol` (optional): Token symbol to filter results **Response Schema:** ```typescript { status: "success" | "error", results: { id: string, symbol: string, name: string, image: string, currentPrice: number, marketCap: number, marketCapRank: number, fullyDilutedValuation: number, totalVolume: number, high24h: number, low24h: number, priceChange24h: number, priceChangePercentage24h: number, marketCapChange24h: number, marketCapChangePercentage24h: number, circulatingSupply: number, totalSupply: number, maxSupply: number, ath: number, athChangePercentage: number, athDate: string, atl: number, atlChangePercentage: number, atlDate: string, lastUpdated: string } } ``` ### Documentation API #### GET /docs/scrape Scrape and cache documentation from a given URL. **Query Parameters:** - `url` (required): URL of the documentation site to scrape **Response Schema:** ```typescript { status: "success" | "error", data: { title: string, last_updated: string, total_sections: number, sections: { topic: string, source_url: string, overview: string, subsections: { title: string, content: string }[] }[] }, source: "cache" | "fresh" } | { status: "error", message: string } ``` **Cache Behavior:** - Documentation is cached for 7 days - Returns cached version if available and not expired - Automatically refreshes cache if data is older than 7 days ### Error Handling All routes use a standardized error response format: ```typescript { status: "error", code: string, message: string } ``` ## Contributing Contributions are welcome! Please read our contributing guidelines before submitting pull requests. ## License MIT