UNPKG

@thecodingwhale/cv-processor

Version:

CV Processor to extract structured data from PDF resumes using TypeScript

304 lines (228 loc) 8.88 kB
# CV Processor (TypeScript) A TypeScript/Node.js tool to extract structured data from CV/resume PDFs. ## Overview This tool processes PDF resumes/CVs and extracts structured information into JSON format, making it easier to analyze, search, and integrate CV data into applications. It's specifically designed for actor/actress resumes to extract credits and categorize them properly. ## Features - PDF text extraction and image processing for visual resume analysis - AI-powered extraction using multiple providers: - Google's Gemini AI - OpenAI (GPT-4, etc.) - Azure OpenAI - Grok (X.AI) - AWS Bedrock (Claude, Nova, etc.) - Organized output with categorized credits - CLI interface for easy use - Parallel processing of multiple AI providers - Performance metrics and processing time tracking - Reports analysis and provider comparison ## Installation ```bash # Clone the repository git clone <repository-url> cd cv-processor-ts # Install dependencies npm install # Build the project npm run build ``` ## Configuration To use the AI-powered features, you need to configure your API keys: 1. Create a `.env` file in the project root: ``` # Google Gemini API Key GEMINI_API_KEY=your_gemini_api_key_here # OpenAI API Key OPENAI_API_KEY=your_openai_api_key_here # Azure OpenAI Configuration AZURE_OPENAI_API_KEY=your_azure_openai_api_key_here AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com AZURE_OPENAI_API_VERSION=2024-04-01-preview AZURE_OPENAI_DEPLOYMENT_NAME=your-deployment-name # Grok (X.AI) API Key GROK_API_KEY=your_grok_api_key_here # AWS Bedrock Configuration AWS_ACCESS_KEY_ID=your_aws_access_key_id AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key AWS_REGION=us-east-1 AWS_BEDROCK_INFERENCE_PROFILE_ARN=arn:aws:bedrock:us-east-1:123456789012:inference-profile/my-profile ``` ### Azure OpenAI and AWS Bedrock Setup For detailed setup instructions for Azure OpenAI and AWS Bedrock, please refer to the respective documentation. ## Customizing Instructions The application uses a text file for AI extraction instructions. You can customize these instructions by: 1. Editing the `instructions.txt` file in the project root directory 2. Or specifying a custom instructions file path when creating an AICVProcessor: ```typescript const processor = new AICVProcessor(aiProvider, { instructionsPath: '/path/to/your/custom-instructions.txt', verbose: true, }) ``` The instructions file contains: - The schema definition for extracted data - Categorization rules for actor credits - Extraction rules and guidelines - Examples of expected input/output ## Usage ### Command Line ```bash # Process a PDF resume with default AI (Gemini) npm start -- process path/to/resume.pdf # With verbose output npm start -- process path/to/resume.pdf -v # Specify output file npm start -- process path/to/resume.pdf -o output.json # Use OpenAI instead of Gemini npm start -- process path/to/resume.pdf --use-ai openai # Use Azure OpenAI npm start -- process path/to/resume.pdf --use-ai azure # Use Grok (X.AI) npm start -- process path/to/resume.pdf --use-ai grok # Use AWS Bedrock npm start -- process path/to/resume.pdf --use-ai aws npm start -- process path/to/resume.pdf --use-ai aws --ai-model anthropic.claude-3-sonnet-20240229-v1:0 # Specify a different AI model npm start -- process path/to/resume.pdf --ai-model gpt-4o npm start -- process path/to/resume.pdf --use-ai gemini --ai-model gemini-1.5-flash # Specify conversion type (PDF to Images or PDF to Text) npm start -- process path/to/resume.pdf --conversion-type pdftoimages npm start -- process path/to/resume.pdf --conversion-type pdftotexts # Specify custom instructions file path npm start -- process path/to/resume.pdf --instructions-path ./custom-instructions.txt # Specify expected total fields for emptiness percentage calculation npm start -- process path/to/resume.pdf --expected-total-fields 50 ``` ### Parallel Processing You can process a CV with multiple AI providers in parallel: ```bash # Process with all configured providers simultaneously npm run parallel path/to/resume.pdf # Process with all providers while specifying expected total fields npm run parallel path/to/resume.pdf --expected-total-fields 50 # Example with a real file path npm run parallel ./CVs/KRISTEEN-LY-castingnetworks.pdf --expected-total-fields 108 ``` When using the `--expected-total-fields` parameter, the system will calculate two emptiness percentages: 1. The default percentage based on AI-determined total fields 2. A percentage based on your specified expected total field count This will: 1. Run extractions using all configured AI providers/models in parallel 2. Save all results to an organized output directory 3. Generate a markdown report comparing performance and results 4. Track processing time for benchmarking purposes 5. Include both AI-determined and user-expected emptiness percentages in the report when `--expected-total-fields` is used The output will be saved to: `output/CVName_YYYY-MM-DD_HH-MM-SS/` ### Analyzing Results After running multiple CV processes, you can generate a merged report to compare AI provider performance: ```bash # Generate a merged report from all output directories npm start -- merge-reports # Specify a custom output directory npm start -- merge-reports -d ./my-output-folder # Specify a custom output file for the report npm start -- merge-reports -o performance-analysis.md ``` The merged report provides: 1. Rankings of AI providers by accuracy, speed, and combined performance 2. Detailed metrics for each provider and model 3. Recommendations for the best overall performer 4. Summary of all processing runs This helps identify which AI provider and model combination delivers the best results for your specific CV processing needs. ### API Usage ```typescript import { AIProviderFactory } from './dist/ai/AIProviderFactory' import { AICVProcessor } from './dist/AICVProcessor' const main = async () => { // Configure AI provider const aiConfig = { apiKey: process.env.GEMINI_API_KEY!, model: 'gemini-1.5-pro', } // Create AI provider and processor const aiProvider = AIProviderFactory.createProvider('gemini', aiConfig) const processor = new AICVProcessor(aiProvider, { verbose: true, // Optional: custom instructions path instructionsPath: './my-custom-instructions.txt', }) try { // Process the CV const cvData = await processor.processCv('path/to/resume.pdf') // Save to file processor.saveToJson(cvData, 'output.json') } catch (error) { console.error('Error processing CV:', error) } } main() ``` ## Output Format The processed CV is output as a JSON file with the following structure: ```json { "resume": [ { "category": "Film", "category_id": "a1b2c3d4-e5f6-g7h8-i9j0-k1l2m3n4o5p6", "credits": [ { "id": "b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6", "year": "2023", "title": "Major Motion Picture", "role": "Supporting Character", "director": "Famous Director", "attached_media": [] } ] }, { "category": "Television", "category_id": "c1d2e3f4-g5h6-i7j8-k9l0-m1n2o3p4q5r6", "credits": [ { "id": "d1e2f3g4-h5i6-j7k8-l9m0-n1o2p3q4r5s6", "year": "2022", "title": "Popular TV Show", "role": "Guest Star", "director": "TV Director", "attached_media": [] } ] } ], "resume_show_years": true, "metadata": { "processedDate": "2023-07-01T12:34:56.789Z", "sourceFile": "actor_resume.pdf", "processingTime": 5.23, "provider": "gemini", "model": "gemini-1.5-pro" } } ``` ## AI Provider System The application is designed with a flexible AI provider system that allows you to easily swap between different AI models: 1. **Built-in Providers:** - Google Gemini AI (default) - OpenAI (GPT-4o, etc.) - Azure OpenAI (GPT-4o, etc.) - Grok (X.AI) API - AWS Bedrock (Amazon Nova, etc.) 2. **Performance Metrics:** - Each output includes processing time in seconds - Filenames include the processing time for easy comparison - Parallel processing generates reports comparing all providers - Merged reports identify the best providers based on accuracy and speed ## Dependencies - **@google/generative-ai**: Google Gemini AI integration - **openai**: OpenAI API integration - **pdf-parse**: PDF text extraction - **tesseract.js**: OCR capability - **@aws-sdk/client-bedrock-runtime**: AWS Bedrock integration - **commander**: CLI framework - **dotenv**: Environment variable management - **jsonrepair**: Fix malformed JSON from AI responses - **glob**: File path matching - **poppler-utils**: Required for PDF to image conversion (external dependency) ## License MIT