root commited on
Commit
26e8660
·
1 Parent(s): bca907c
Files changed (6) hide show
  1. README.md +249 -4
  2. config.py +182 -0
  3. requirements.txt +16 -2
  4. sample_resumes.csv +143 -0
  5. src/streamlit_app.py +730 -38
  6. test_installation.py +99 -0
README.md CHANGED
@@ -11,9 +11,254 @@ pinned: false
11
  short_description: Streamlit template space
12
  ---
13
 
14
- # Welcome to Streamlit!
15
 
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
 
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Streamlit template space
12
  ---
13
 
14
+ # 🤖 AI Resume Screener
15
 
16
+ An advanced Streamlit application that automatically ranks candidate resumes against job descriptions using a sophisticated multi-stage AI pipeline.
17
 
18
+ ## 🚀 Features
19
+
20
+ ### Multi-Stage AI Pipeline
21
+ 1. **FAISS Recall**: Semantic similarity search using BGE embeddings (top 50 candidates)
22
+ 2. **Cross-Encoder Reranking**: Deep semantic matching using MS-Marco model (top 20 candidates)
23
+ 3. **BM25 Scoring**: Traditional keyword-based relevance scoring
24
+ 4. **Intent Analysis**: AI-powered candidate interest assessment using Qwen LLM
25
+ 5. **Final Ranking**: Weighted combination of all scores
26
+
27
+ ### Advanced AI Models
28
+ - **Embedding Model**: BAAI/bge-large-en-v1.5 for semantic understanding
29
+ - **Cross-Encoder**: cross-encoder/ms-marco-MiniLM-L6-v2 for precise ranking
30
+ - **LLM**: Qwen2-1.5B with 4-bit quantization for intent analysis
31
+
32
+ ### Multiple Input Methods
33
+ - **File Upload**: PDF, DOCX, TXT files
34
+ - **CSV Upload**: Bulk resume processing
35
+ - **Hugging Face Datasets**: Direct integration with HF datasets
36
+
37
+ ### Comprehensive Analysis
38
+ - **Skills Extraction**: Technical skills and job-specific keywords
39
+ - **Score Breakdown**: Detailed analysis of each scoring component
40
+ - **Interactive Visualizations**: Charts and metrics for insights
41
+ - **Export Capabilities**: Download results as CSV
42
+
43
+ ## 📋 Requirements
44
+
45
+ ### System Requirements
46
+ - Python 3.8+
47
+ - CUDA-compatible GPU (recommended for optimal performance)
48
+ - 8GB+ RAM (16GB+ recommended)
49
+ - 10GB+ disk space for models
50
+
51
+ ### Dependencies
52
+ All dependencies are listed in `requirements.txt`:
53
+ - streamlit
54
+ - sentence-transformers
55
+ - transformers
56
+ - torch
57
+ - faiss-cpu
58
+ - rank-bm25
59
+ - nltk
60
+ - pdfplumber
61
+ - PyPDF2
62
+ - python-docx
63
+ - datasets
64
+ - plotly
65
+ - pandas
66
+ - numpy
67
+
68
+ ## 🛠️ Installation
69
+
70
+ 1. **Clone the repository**:
71
+ ```bash
72
+ git clone <repository-url>
73
+ cd resumescreener_v2
74
+ ```
75
+
76
+ 2. **Install dependencies**:
77
+ ```bash
78
+ pip install -r requirements.txt
79
+ ```
80
+
81
+ 3. **Run the application**:
82
+ ```bash
83
+ streamlit run src/streamlit_app.py
84
+ ```
85
+
86
+ ## 📖 Usage Guide
87
+
88
+ ### Step 1: Model Loading
89
+ - Models are automatically loaded when the app starts
90
+ - First run may take 5-10 minutes to download models
91
+ - Check the sidebar for model loading status
92
+
93
+ ### Step 2: Job Description
94
+ - Enter the complete job description in the text area
95
+ - Include requirements, responsibilities, and desired skills
96
+ - More detailed descriptions yield better matching results
97
+
98
+ ### Step 3: Load Resumes
99
+ Choose from three options:
100
+
101
+ #### Option A: File Upload
102
+ - Upload PDF, DOCX, or TXT files
103
+ - Supports multiple file selection
104
+ - Automatic text extraction
105
+
106
+ #### Option B: CSV Upload
107
+ - Upload CSV with resume texts
108
+ - Select text and name columns
109
+ - Bulk processing capability
110
+
111
+ #### Option C: Hugging Face Dataset
112
+ - Load from public datasets
113
+ - Specify dataset name and columns
114
+ - Limited to 100 resumes for performance
115
+
116
+ ### Step 4: Run Pipeline
117
+ - Click "Run Advanced Ranking Pipeline"
118
+ - Monitor progress through 5 stages
119
+ - Results appear in three tabs
120
+
121
+ ### Step 5: Analyze Results
122
+
123
+ #### Summary Tab
124
+ - Top-ranked candidates table
125
+ - Key metrics and scores
126
+ - CSV download option
127
+
128
+ #### Detailed Analysis Tab
129
+ - Individual candidate breakdowns
130
+ - Score components explanation
131
+ - Skills and keywords analysis
132
+ - Resume excerpts
133
+
134
+ #### Visualizations Tab
135
+ - Score distribution charts
136
+ - Comparative analysis
137
+ - Intent distribution
138
+ - Average metrics
139
+
140
+ ## 🧮 Scoring Formula
141
+
142
+ **Final Score = 0.5 × Cross-Encoder + 0.3 × BM25 + 0.2 × Intent**
143
+
144
+ ### Score Components
145
+
146
+ 1. **Cross-Encoder Score (50%)**
147
+ - Deep semantic matching between job and resume
148
+ - Considers context and meaning
149
+ - Range: 0-1 (normalized)
150
+
151
+ 2. **BM25 Score (30%)**
152
+ - Traditional keyword-based relevance
153
+ - Term frequency and document frequency
154
+ - Range: 0-1 (normalized)
155
+
156
+ 3. **Intent Score (20%)**
157
+ - AI-assessed candidate interest level
158
+ - Based on experience-job alignment
159
+ - Categories: Yes (0.9), Maybe (0.5), No (0.1)
160
+
161
+ ## 🎯 Best Practices
162
+
163
+ ### For Optimal Results
164
+ 1. **Detailed Job Descriptions**: Include specific requirements, technologies, and responsibilities
165
+ 2. **Quality Resume Data**: Ensure resumes contain relevant information
166
+ 3. **Appropriate Batch Size**: Process 20-100 resumes for best performance
167
+ 4. **Clear Requirements**: Specify must-have vs. nice-to-have skills
168
+
169
+ ### Performance Tips
170
+ 1. **GPU Usage**: Enable CUDA for faster processing
171
+ 2. **Memory Management**: Use cleanup controls for large batches
172
+ 3. **Model Caching**: Models are cached after first load
173
+ 4. **Batch Processing**: Process resumes in smaller batches if memory limited
174
+
175
+ ## 🔧 Configuration
176
+
177
+ ### Model Configuration
178
+ Models can be customized by modifying the `load_models()` function:
179
+ - Change model names for different embeddings
180
+ - Adjust quantization settings
181
+ - Modify device mapping
182
+
183
+ ### Scoring Weights
184
+ Adjust weights in `calculate_final_scores()`:
185
+ ```python
186
+ final_scores = 0.5 * ce_scores + 0.3 * bm25_scores + 0.2 * intent_scores
187
+ ```
188
+
189
+ ### Skills List
190
+ Customize the predefined skills list in the `ResumeScreener` class:
191
+ ```python
192
+ self.skills_list = [
193
+ 'python', 'java', 'javascript',
194
+ # Add your specific skills
195
+ ]
196
+ ```
197
+
198
+ ## 🐛 Troubleshooting
199
+
200
+ ### Common Issues
201
+
202
+ 1. **Model Loading Errors**
203
+ - Check internet connection for model downloads
204
+ - Ensure sufficient disk space
205
+ - Verify CUDA compatibility
206
+
207
+ 2. **Memory Issues**
208
+ - Reduce batch size
209
+ - Use CPU-only mode
210
+ - Clear cache between runs
211
+
212
+ 3. **File Processing Errors**
213
+ - Check file formats (PDF, DOCX, TXT)
214
+ - Ensure files are not corrupted
215
+ - Verify text extraction quality
216
+
217
+ 4. **Performance Issues**
218
+ - Enable GPU acceleration
219
+ - Process smaller batches
220
+ - Use model quantization
221
+
222
+ ### Error Messages
223
+ - **"Models not loaded"**: Wait for model loading to complete
224
+ - **"ML libraries not available"**: Install missing dependencies
225
+ - **"CUDA out of memory"**: Reduce batch size or use CPU
226
+
227
+ ## 📊 Sample Data
228
+
229
+ Use the included `sample_resumes.csv` for testing:
230
+ - 5 sample resumes with different roles
231
+ - Realistic job experience and skills
232
+ - Good for testing all features
233
+
234
+ ## 🤝 Contributing
235
+
236
+ 1. Fork the repository
237
+ 2. Create a feature branch
238
+ 3. Make your changes
239
+ 4. Add tests if applicable
240
+ 5. Submit a pull request
241
+
242
+ ## 📄 License
243
+
244
+ This project is licensed under the MIT License - see the LICENSE file for details.
245
+
246
+ ## 🙏 Acknowledgments
247
+
248
+ - **BAAI** for the BGE embedding model
249
+ - **Microsoft** for the MS-Marco cross-encoder
250
+ - **Alibaba** for the Qwen language model
251
+ - **Streamlit** for the web framework
252
+ - **Hugging Face** for model hosting and transformers library
253
+
254
+ ## 📞 Support
255
+
256
+ For issues and questions:
257
+ 1. Check the troubleshooting section
258
+ 2. Review error messages in the sidebar
259
+ 3. Open an issue on GitHub
260
+ 4. Check model compatibility
261
+
262
+ ---
263
+
264
+ **Built with ❤️ using Streamlit and state-of-the-art AI models**
config.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration file for AI Resume Screener
3
+ Modify these settings to customize the application behavior
4
+ """
5
+
6
+ # Model Configuration
7
+ MODELS = {
8
+ "embedding_model": "BAAI/bge-large-en-v1.5",
9
+ "cross_encoder": "cross-encoder/ms-marco-MiniLM-L6-v2",
10
+ "llm_model": "Qwen/Qwen2-1.5B", # Using smaller model for compatibility
11
+ }
12
+
13
+ # Pipeline Configuration
14
+ PIPELINE_CONFIG = {
15
+ "faiss_recall_top_k": 50,
16
+ "cross_encoder_top_k": 20,
17
+ "max_text_length": 8000,
18
+ "embedding_dimension": 1024,
19
+ }
20
+
21
+ # Scoring Weights (must sum to 1.0)
22
+ SCORING_WEIGHTS = {
23
+ "cross_encoder": 0.5,
24
+ "bm25": 0.3,
25
+ "intent": 0.2,
26
+ }
27
+
28
+ # Intent Analysis Configuration
29
+ INTENT_CONFIG = {
30
+ "max_prompt_length": 1024,
31
+ "max_new_tokens": 10,
32
+ "temperature": 0.1,
33
+ "intent_scores": {
34
+ "yes": 0.9,
35
+ "maybe": 0.5,
36
+ "no": 0.1,
37
+ }
38
+ }
39
+
40
+ # File Processing Configuration
41
+ FILE_CONFIG = {
42
+ "supported_formats": ["pdf", "docx", "txt", "csv"],
43
+ "max_file_size_mb": 10,
44
+ "max_files_per_upload": 50,
45
+ }
46
+
47
+ # UI Configuration
48
+ UI_CONFIG = {
49
+ "page_title": "🤖 AI Resume Screener",
50
+ "page_icon": "🤖",
51
+ "layout": "wide",
52
+ "sidebar_state": "expanded",
53
+ "max_display_resumes": 100,
54
+ }
55
+
56
+ # Performance Configuration
57
+ PERFORMANCE_CONFIG = {
58
+ "use_gpu": True,
59
+ "quantization": True,
60
+ "batch_size": 32,
61
+ "cache_models": True,
62
+ }
63
+
64
+ # Skills Database
65
+ TECHNICAL_SKILLS = [
66
+ # Programming Languages
67
+ 'python', 'java', 'javascript', 'typescript', 'c++', 'c#', 'go', 'rust',
68
+ 'scala', 'r', 'matlab', 'php', 'ruby', 'swift', 'kotlin', 'dart',
69
+
70
+ # Web Technologies
71
+ 'html', 'css', 'react', 'angular', 'vue', 'node.js', 'express', 'django',
72
+ 'flask', 'fastapi', 'spring', 'laravel', 'bootstrap', 'tailwind',
73
+
74
+ # Databases
75
+ 'sql', 'mongodb', 'postgresql', 'mysql', 'redis', 'elasticsearch',
76
+ 'cassandra', 'dynamodb', 'sqlite', 'oracle',
77
+
78
+ # Cloud & DevOps
79
+ 'aws', 'azure', 'gcp', 'docker', 'kubernetes', 'terraform', 'ansible',
80
+ 'jenkins', 'gitlab', 'github', 'ci/cd', 'devops', 'microservices',
81
+
82
+ # Data Science & ML
83
+ 'machine learning', 'deep learning', 'tensorflow', 'pytorch', 'keras',
84
+ 'scikit-learn', 'pandas', 'numpy', 'matplotlib', 'plotly', 'seaborn',
85
+ 'jupyter', 'spark', 'hadoop', 'kafka', 'airflow',
86
+
87
+ # Analytics & BI
88
+ 'tableau', 'powerbi', 'excel', 'google analytics', 'mixpanel', 'amplitude',
89
+ 'looker', 'qlik', 'sas', 'spss', 'stata',
90
+
91
+ # Operating Systems & Tools
92
+ 'linux', 'ubuntu', 'centos', 'windows', 'macos', 'bash', 'powershell',
93
+ 'git', 'vim', 'vscode', 'intellij', 'eclipse',
94
+
95
+ # Methodologies
96
+ 'agile', 'scrum', 'kanban', 'lean', 'waterfall', 'tdd', 'bdd',
97
+
98
+ # Networking & Security
99
+ 'tcp/ip', 'http', 'https', 'ssl', 'oauth', 'jwt', 'api', 'rest', 'graphql',
100
+ 'nginx', 'apache', 'load balancing', 'vpn', 'firewall',
101
+ ]
102
+
103
+ # Job Categories for Enhanced Matching
104
+ JOB_CATEGORIES = {
105
+ "software_engineer": [
106
+ "programming", "coding", "development", "software", "engineer", "developer"
107
+ ],
108
+ "data_scientist": [
109
+ "data", "analytics", "machine learning", "statistics", "modeling", "scientist"
110
+ ],
111
+ "devops_engineer": [
112
+ "devops", "infrastructure", "deployment", "automation", "cloud", "operations"
113
+ ],
114
+ "product_manager": [
115
+ "product", "manager", "strategy", "roadmap", "requirements", "stakeholder"
116
+ ],
117
+ "designer": [
118
+ "design", "ui", "ux", "user experience", "interface", "visual", "creative"
119
+ ],
120
+ "marketing": [
121
+ "marketing", "campaign", "brand", "social media", "content", "seo", "sem"
122
+ ],
123
+ "sales": [
124
+ "sales", "business development", "account", "revenue", "client", "customer"
125
+ ]
126
+ }
127
+
128
+ # Default Job Description Template
129
+ DEFAULT_JOB_DESCRIPTION = """
130
+ Software Engineer - Full Stack Development
131
+
132
+ We are looking for a talented Software Engineer to join our growing team.
133
+
134
+ Requirements:
135
+ - 3+ years of experience in software development
136
+ - Proficiency in Python, JavaScript, and SQL
137
+ - Experience with React and Node.js
138
+ - Knowledge of cloud platforms (AWS, Azure, or GCP)
139
+ - Familiarity with Docker and CI/CD pipelines
140
+ - Strong problem-solving and communication skills
141
+
142
+ Responsibilities:
143
+ - Develop and maintain web applications
144
+ - Collaborate with cross-functional teams
145
+ - Write clean, maintainable code
146
+ - Participate in code reviews
147
+ - Contribute to technical architecture decisions
148
+
149
+ Nice to have:
150
+ - Experience with machine learning
151
+ - Knowledge of microservices architecture
152
+ - DevOps experience
153
+ - Open source contributions
154
+ """
155
+
156
+ # Error Messages
157
+ ERROR_MESSAGES = {
158
+ "models_not_loaded": "❌ AI models are still loading. Please wait...",
159
+ "no_job_description": "❌ Please enter a job description",
160
+ "no_resumes": "❌ Please load some resumes first",
161
+ "file_processing_error": "❌ Error processing file: {filename}",
162
+ "model_loading_error": "❌ Error loading model: {model_name}",
163
+ "pipeline_error": "❌ Error in pipeline stage: {stage}",
164
+ }
165
+
166
+ # Success Messages
167
+ SUCCESS_MESSAGES = {
168
+ "models_loaded": "✅ All AI models loaded successfully!",
169
+ "files_processed": "✅ Processed {count} resume files",
170
+ "pipeline_complete": "✅ Resume screening pipeline completed!",
171
+ "results_exported": "✅ Results exported successfully",
172
+ }
173
+
174
+ # Validation Rules
175
+ VALIDATION_RULES = {
176
+ "min_job_description_length": 50,
177
+ "max_job_description_length": 10000,
178
+ "min_resume_length": 20,
179
+ "max_resume_length": 20000,
180
+ "min_resumes_for_ranking": 1,
181
+ "max_resumes_for_ranking": 1000,
182
+ }
requirements.txt CHANGED
@@ -1,3 +1,17 @@
1
- altair
2
  pandas
3
- streamlit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
  pandas
3
+ numpy
4
+ sentence-transformers
5
+ transformers
6
+ torch
7
+ accelerate
8
+ bitsandbytes
9
+ faiss-cpu
10
+ rank-bm25
11
+ nltk
12
+ pdfplumber
13
+ PyPDF2
14
+ python-docx
15
+ datasets
16
+ plotly
17
+ altair
sample_resumes.csv ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name,resume_text
2
+ John Smith,"John Smith
3
+ Software Engineer
4
5
+ Phone: (555) 123-4567
6
+
7
+ EXPERIENCE
8
+ Senior Software Engineer | TechCorp | 2020-2023
9
+ - Developed scalable web applications using Python, Django, and React
10
+ - Led a team of 5 developers in building microservices architecture
11
+ - Implemented CI/CD pipelines using Jenkins and Docker
12
+ - Worked with AWS services including EC2, S3, and RDS
13
+
14
+ Software Developer | StartupXYZ | 2018-2020
15
+ - Built REST APIs using Flask and PostgreSQL
16
+ - Developed frontend components using JavaScript and Vue.js
17
+ - Collaborated with cross-functional teams using Agile methodology
18
+
19
+ EDUCATION
20
+ Bachelor of Science in Computer Science | University of Technology | 2018
21
+
22
+ SKILLS
23
+ Programming: Python, JavaScript, Java, SQL
24
+ Frameworks: Django, Flask, React, Vue.js
25
+ Databases: PostgreSQL, MySQL, MongoDB
26
+ Cloud: AWS, Docker, Kubernetes
27
+ Tools: Git, Jenkins, JIRA"
28
+
29
+ Sarah Johnson,"Sarah Johnson
30
+ Data Scientist
31
32
+ Phone: (555) 987-6543
33
+
34
+ EXPERIENCE
35
+ Senior Data Scientist | DataTech Solutions | 2021-2023
36
+ - Developed machine learning models using Python, scikit-learn, and TensorFlow
37
+ - Built predictive analytics solutions for customer behavior analysis
38
+ - Created data pipelines using Apache Spark and Kafka
39
+ - Deployed models to production using MLOps practices
40
+
41
+ Data Analyst | Analytics Inc | 2019-2021
42
+ - Performed statistical analysis using R and Python
43
+ - Created interactive dashboards using Tableau and PowerBI
44
+ - Worked with large datasets using SQL and Pandas
45
+ - Collaborated with business stakeholders to define KPIs
46
+
47
+ EDUCATION
48
+ Master of Science in Data Science | Data University | 2019
49
+ Bachelor of Science in Statistics | Math College | 2017
50
+
51
+ SKILLS
52
+ Programming: Python, R, SQL, Scala
53
+ ML/AI: scikit-learn, TensorFlow, PyTorch, Keras
54
+ Big Data: Spark, Hadoop, Kafka
55
+ Visualization: Tableau, PowerBI, Matplotlib, Plotly
56
+ Statistics: Hypothesis testing, A/B testing, Regression analysis"
57
+
58
+ Mike Chen,"Mike Chen
59
+ DevOps Engineer
60
61
+ Phone: (555) 456-7890
62
+
63
+ EXPERIENCE
64
+ DevOps Engineer | CloudFirst | 2020-2023
65
+ - Managed AWS infrastructure using Terraform and CloudFormation
66
+ - Implemented monitoring and alerting using Prometheus and Grafana
67
+ - Automated deployment processes using Jenkins and GitLab CI
68
+ - Maintained Kubernetes clusters and Docker containers
69
+
70
+ System Administrator | TechServices | 2018-2020
71
+ - Administered Linux servers and network infrastructure
72
+ - Implemented backup and disaster recovery solutions
73
+ - Managed database systems including MySQL and PostgreSQL
74
+ - Provided technical support and troubleshooting
75
+
76
+ EDUCATION
77
+ Bachelor of Science in Information Technology | Tech Institute | 2018
78
+
79
+ SKILLS
80
+ Cloud Platforms: AWS, Azure, GCP
81
+ Infrastructure: Terraform, CloudFormation, Ansible
82
+ Containers: Docker, Kubernetes, OpenShift
83
+ Monitoring: Prometheus, Grafana, ELK Stack
84
+ Operating Systems: Linux, Ubuntu, CentOS
85
+ Scripting: Bash, Python, PowerShell"
86
+
87
+ Lisa Wang,"Lisa Wang
88
+ Frontend Developer
89
90
+ Phone: (555) 321-0987
91
+
92
+ EXPERIENCE
93
+ Senior Frontend Developer | WebSolutions | 2021-2023
94
+ - Developed responsive web applications using React and TypeScript
95
+ - Implemented modern CSS frameworks including Tailwind and Bootstrap
96
+ - Optimized application performance and user experience
97
+ - Collaborated with UX/UI designers and backend developers
98
+
99
+ Frontend Developer | DigitalAgency | 2019-2021
100
+ - Built interactive user interfaces using Angular and JavaScript
101
+ - Created mobile-responsive designs using HTML5 and CSS3
102
+ - Integrated frontend applications with REST APIs
103
+ - Participated in code reviews and agile development processes
104
+
105
+ EDUCATION
106
+ Bachelor of Arts in Web Design | Design College | 2019
107
+
108
+ SKILLS
109
+ Languages: JavaScript, TypeScript, HTML5, CSS3
110
+ Frameworks: React, Angular, Vue.js
111
+ Styling: Tailwind CSS, Bootstrap, Sass, Less
112
+ Tools: Webpack, Vite, npm, yarn
113
+ Version Control: Git, GitHub, GitLab
114
+ Testing: Jest, Cypress, React Testing Library"
115
+
116
+ Robert Brown,"Robert Brown
117
+ Product Manager
118
119
+ Phone: (555) 654-3210
120
+
121
+ EXPERIENCE
122
+ Senior Product Manager | InnovateTech | 2020-2023
123
+ - Led product strategy and roadmap for B2B SaaS platform
124
+ - Managed cross-functional teams of 15+ engineers and designers
125
+ - Conducted market research and competitive analysis
126
+ - Defined product requirements and user stories using Agile methodology
127
+
128
+ Product Manager | StartupHub | 2018-2020
129
+ - Launched 3 new product features resulting in 25% user growth
130
+ - Collaborated with engineering teams to prioritize development tasks
131
+ - Analyzed user feedback and metrics to drive product decisions
132
+ - Coordinated go-to-market strategies with marketing and sales teams
133
+
134
+ EDUCATION
135
+ MBA in Business Administration | Business School | 2018
136
+ Bachelor of Science in Engineering | Engineering University | 2016
137
+
138
+ SKILLS
139
+ Product Management: Roadmapping, User Research, A/B Testing
140
+ Analytics: Google Analytics, Mixpanel, Amplitude
141
+ Project Management: JIRA, Asana, Trello
142
+ Methodologies: Agile, Scrum, Lean Startup
143
+ Communication: Stakeholder Management, Presentation Skills"
src/streamlit_app.py CHANGED
@@ -1,40 +1,732 @@
1
- import altair as alt
2
- import numpy as np
3
- import pandas as pd
4
  import streamlit as st
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- """
7
- # Welcome to Streamlit!
8
-
9
- Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
10
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
11
- forums](https://discuss.streamlit.io).
12
-
13
- In the meantime, below is an example of what you can do with just a few lines of code:
14
- """
15
-
16
- num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
17
- num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
18
-
19
- indices = np.linspace(0, 1, num_points)
20
- theta = 2 * np.pi * num_turns * indices
21
- radius = indices
22
-
23
- x = radius * np.cos(theta)
24
- y = radius * np.sin(theta)
25
-
26
- df = pd.DataFrame({
27
- "x": x,
28
- "y": y,
29
- "idx": indices,
30
- "rand": np.random.randn(num_points),
31
- })
32
-
33
- st.altair_chart(alt.Chart(df, height=700, width=700)
34
- .mark_point(filled=True)
35
- .encode(
36
- x=alt.X("x", axis=None),
37
- y=alt.Y("y", axis=None),
38
- color=alt.Color("idx", legend=None, scale=alt.Scale()),
39
- size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
40
- ))
 
 
 
 
1
  import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import plotly.express as px
5
+ import plotly.graph_objects as go
6
+ from io import BytesIO
7
+ import base64
8
+ import os
9
+ import re
10
+ import warnings
11
+ warnings.filterwarnings("ignore")
12
+
13
+ # ML/NLP imports
14
+ try:
15
+ from sentence_transformers import SentenceTransformer, CrossEncoder
16
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
17
+ import torch
18
+ import faiss
19
+ from rank_bm25 import BM25Okapi
20
+ import nltk
21
+ from nltk.tokenize import word_tokenize
22
+ import pdfplumber
23
+ import PyPDF2
24
+ from docx import Document
25
+ from datasets import load_dataset
26
+ ML_IMPORTS_AVAILABLE = True
27
+ except ImportError as e:
28
+ st.error(f"Missing required ML libraries: {e}")
29
+ ML_IMPORTS_AVAILABLE = False
30
+
31
+ # Download NLTK data
32
+ try:
33
+ nltk.download('punkt', quiet=True)
34
+ nltk.download('stopwords', quiet=True)
35
+ except:
36
+ pass
37
+
38
+ # Page configuration
39
+ st.set_page_config(
40
+ page_title="🤖 AI Resume Screener",
41
+ page_icon="🤖",
42
+ layout="wide",
43
+ initial_sidebar_state="expanded"
44
+ )
45
+
46
+ # Initialize session state
47
+ if 'models_loaded' not in st.session_state:
48
+ st.session_state.models_loaded = False
49
+ if 'embedding_model' not in st.session_state:
50
+ st.session_state.embedding_model = None
51
+ if 'cross_encoder' not in st.session_state:
52
+ st.session_state.cross_encoder = None
53
+ if 'llm_tokenizer' not in st.session_state:
54
+ st.session_state.llm_tokenizer = None
55
+ if 'llm_model' not in st.session_state:
56
+ st.session_state.llm_model = None
57
+ if 'model_errors' not in st.session_state:
58
+ st.session_state.model_errors = {}
59
+ if 'resume_texts' not in st.session_state:
60
+ st.session_state.resume_texts = []
61
+ if 'resume_filenames' not in st.session_state:
62
+ st.session_state.resume_filenames = []
63
+ if 'results' not in st.session_state:
64
+ st.session_state.results = None
65
+
66
+ def load_models():
67
+ """Load all ML models at startup"""
68
+ if st.session_state.models_loaded:
69
+ return
70
+
71
+ st.info("🔄 Loading AI models... This may take a few minutes on first run.")
72
+
73
+ # Load embedding model
74
+ try:
75
+ print("Loading embedding model: BAAI/bge-large-en-v1.5")
76
+ st.text("Loading embedding model...")
77
+ try:
78
+ st.session_state.embedding_model = SentenceTransformer(
79
+ 'BAAI/bge-large-en-v1.5',
80
+ device_map="auto"
81
+ )
82
+ except Exception as e:
83
+ print(f"Device map failed, falling back to default: {e}")
84
+ st.session_state.embedding_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
85
+ print("✅ Embedding model loaded successfully")
86
+ except Exception as e:
87
+ print(f"❌ Error loading embedding model: {e}")
88
+ st.session_state.model_errors['embedding'] = str(e)
89
+
90
+ # Load cross-encoder
91
+ try:
92
+ print("Loading cross-encoder: cross-encoder/ms-marco-MiniLM-L6-v2")
93
+ st.text("Loading cross-encoder...")
94
+ st.session_state.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
95
+ print("✅ Cross-encoder loaded successfully")
96
+ except Exception as e:
97
+ print(f"❌ Error loading cross-encoder: {e}")
98
+ st.session_state.model_errors['cross_encoder'] = str(e)
99
+
100
+ # Load LLM for intent analysis
101
+ try:
102
+ print("Loading LLM: Qwen/Qwen2-1.5B") # Using smaller model for better compatibility
103
+ st.text("Loading LLM for intent analysis...")
104
+
105
+ # Quantization config
106
+ bnb_config = BitsAndBytesConfig(
107
+ load_in_4bit=True,
108
+ bnb_4bit_use_double_quant=True,
109
+ bnb_4bit_quant_type="nf4",
110
+ bnb_4bit_compute_dtype=torch.bfloat16
111
+ )
112
+
113
+ st.session_state.llm_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B")
114
+ st.session_state.llm_model = AutoModelForCausalLM.from_pretrained(
115
+ "Qwen/Qwen2-1.5B",
116
+ quantization_config=bnb_config,
117
+ device_map="auto",
118
+ trust_remote_code=True
119
+ )
120
+ print("✅ LLM loaded successfully")
121
+ except Exception as e:
122
+ print(f"❌ Error loading LLM: {e}")
123
+ st.session_state.model_errors['llm'] = str(e)
124
+
125
+ st.session_state.models_loaded = True
126
+ st.success("✅ All models loaded successfully!")
127
+
128
+ class ResumeScreener:
129
+ def __init__(self):
130
+ self.embedding_model = st.session_state.embedding_model
131
+ self.cross_encoder = st.session_state.cross_encoder
132
+ self.llm_tokenizer = st.session_state.llm_tokenizer
133
+ self.llm_model = st.session_state.llm_model
134
+
135
+ # Predefined skills list
136
+ self.skills_list = [
137
+ 'python', 'java', 'javascript', 'react', 'angular', 'vue', 'node.js',
138
+ 'sql', 'mongodb', 'postgresql', 'mysql', 'aws', 'azure', 'gcp',
139
+ 'docker', 'kubernetes', 'git', 'machine learning', 'deep learning',
140
+ 'tensorflow', 'pytorch', 'scikit-learn', 'pandas', 'numpy',
141
+ 'html', 'css', 'bootstrap', 'tailwind', 'api', 'rest', 'graphql',
142
+ 'microservices', 'agile', 'scrum', 'devops', 'ci/cd', 'jenkins',
143
+ 'linux', 'bash', 'shell scripting', 'data analysis', 'statistics',
144
+ 'excel', 'powerbi', 'tableau', 'spark', 'hadoop', 'kafka',
145
+ 'redis', 'elasticsearch', 'nginx', 'apache', 'django', 'flask',
146
+ 'spring', 'express', 'fastapi', 'laravel', 'php', 'c++', 'c#',
147
+ 'go', 'rust', 'scala', 'r', 'matlab', 'sas', 'spss'
148
+ ]
149
+
150
+ def extract_text_from_file(self, file):
151
+ """Extract text from uploaded files"""
152
+ try:
153
+ if file.type == "application/pdf":
154
+ # Try pdfplumber first
155
+ try:
156
+ with pdfplumber.open(file) as pdf:
157
+ text = ""
158
+ for page in pdf.pages:
159
+ text += page.extract_text() or ""
160
+ return text
161
+ except:
162
+ # Fallback to PyPDF2
163
+ file.seek(0)
164
+ reader = PyPDF2.PdfReader(file)
165
+ text = ""
166
+ for page in reader.pages:
167
+ text += page.extract_text()
168
+ return text
169
+
170
+ elif file.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
171
+ doc = Document(file)
172
+ text = ""
173
+ for paragraph in doc.paragraphs:
174
+ text += paragraph.text + "\n"
175
+ return text
176
+
177
+ elif file.type == "text/plain":
178
+ return str(file.read(), "utf-8")
179
+
180
+ elif file.type == "text/csv":
181
+ df = pd.read_csv(file)
182
+ return df.to_string()
183
+
184
+ else:
185
+ return "Unsupported file type"
186
+
187
+ except Exception as e:
188
+ st.warning(f"Error extracting text from {file.name}: {str(e)}")
189
+ return ""
190
+
191
+ def get_embedding(self, text):
192
+ """Get embedding for text"""
193
+ if not self.embedding_model:
194
+ return None
195
+
196
+ if not text or len(text.strip()) == 0:
197
+ return np.zeros(1024) # Default embedding size for BGE
198
+
199
+ # Truncate if too long
200
+ if len(text) > 8000:
201
+ text = text[:8000]
202
+
203
+ try:
204
+ embedding = self.embedding_model.encode(text, normalize_embeddings=True)
205
+ return embedding
206
+ except Exception as e:
207
+ st.warning(f"Error getting embedding: {e}")
208
+ return np.zeros(1024)
209
+
210
+ def calculate_bm25_scores(self, resume_texts, job_description):
211
+ """Calculate BM25 scores"""
212
+ try:
213
+ # Tokenize documents
214
+ tokenized_resumes = [word_tokenize(text.lower()) for text in resume_texts]
215
+ tokenized_job = word_tokenize(job_description.lower())
216
+
217
+ # Create BM25 object
218
+ bm25 = BM25Okapi(tokenized_resumes)
219
+
220
+ # Get scores
221
+ scores = bm25.get_scores(tokenized_job)
222
+ return scores
223
+ except Exception as e:
224
+ st.warning(f"Error calculating BM25 scores: {e}")
225
+ return np.zeros(len(resume_texts))
226
+
227
+ def faiss_recall(self, resume_texts, job_description, top_k=50):
228
+ """FAISS-based recall for top candidates"""
229
+ try:
230
+ if not self.embedding_model:
231
+ return list(range(min(top_k, len(resume_texts))))
232
+
233
+ # Get embeddings
234
+ resume_embeddings = np.array([self.get_embedding(text) for text in resume_texts])
235
+ job_embedding = self.get_embedding(job_description).reshape(1, -1)
236
+
237
+ # Build FAISS index
238
+ dimension = resume_embeddings.shape[1]
239
+ index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
240
+ index.add(resume_embeddings.astype('float32'))
241
+
242
+ # Search
243
+ scores, indices = index.search(job_embedding.astype('float32'), min(top_k, len(resume_texts)))
244
+
245
+ return indices[0].tolist()
246
+ except Exception as e:
247
+ st.warning(f"Error in FAISS recall: {e}")
248
+ return list(range(min(top_k, len(resume_texts))))
249
+
250
+ def cross_encoder_rerank(self, resume_texts, job_description, candidate_indices, top_k=20):
251
+ """Re-rank candidates using cross-encoder"""
252
+ try:
253
+ if not self.cross_encoder:
254
+ return candidate_indices[:top_k]
255
+
256
+ # Prepare pairs for cross-encoder
257
+ pairs = [(job_description, resume_texts[i]) for i in candidate_indices]
258
+
259
+ # Get scores
260
+ scores = self.cross_encoder.predict(pairs)
261
+
262
+ # Sort by scores and return top_k
263
+ scored_indices = list(zip(candidate_indices, scores))
264
+ scored_indices.sort(key=lambda x: x[1], reverse=True)
265
+
266
+ return [idx for idx, _ in scored_indices[:top_k]]
267
+ except Exception as e:
268
+ st.warning(f"Error in cross-encoder reranking: {e}")
269
+ return candidate_indices[:top_k]
270
+
271
+ def analyze_intent(self, resume_text, job_description):
272
+ """Analyze candidate intent using LLM"""
273
+ try:
274
+ if not self.llm_model or not self.llm_tokenizer:
275
+ return "Maybe", 0.5
276
+
277
+ prompt = f"""Analyze if this candidate is genuinely interested in this job based on their resume.
278
+
279
+ Job Description: {job_description[:500]}...
280
+
281
+ Resume: {resume_text[:1000]}...
282
+
283
+ Based on the alignment between the candidate's experience and the job requirements, classify their intent as:
284
+ - Yes: Strong alignment and genuine interest
285
+ - Maybe: Some alignment but unclear intent
286
+ - No: Poor alignment or likely not interested
287
+
288
+ Intent:"""
289
+
290
+ inputs = self.llm_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
291
+
292
+ with torch.no_grad():
293
+ outputs = self.llm_model.generate(
294
+ **inputs,
295
+ max_new_tokens=10,
296
+ temperature=0.1,
297
+ do_sample=True,
298
+ pad_token_id=self.llm_tokenizer.eos_token_id
299
+ )
300
+
301
+ response = self.llm_tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
302
+
303
+ # Parse response
304
+ if "yes" in response.lower():
305
+ return "Yes", 0.9
306
+ elif "no" in response.lower():
307
+ return "No", 0.1
308
+ else:
309
+ return "Maybe", 0.5
310
+
311
+ except Exception as e:
312
+ st.warning(f"Error in intent analysis: {e}")
313
+ return "Maybe", 0.5
314
+
315
+ def extract_skills(self, text, job_description):
316
+ """Extract matching skills from resume"""
317
+ text_lower = text.lower()
318
+ job_lower = job_description.lower()
319
+
320
+ # Find skills from predefined list
321
+ found_skills = []
322
+ for skill in self.skills_list:
323
+ if skill in text_lower:
324
+ found_skills.append(skill)
325
+
326
+ # Extract job-specific keywords (simple approach)
327
+ job_words = set(re.findall(r'\b[a-zA-Z]{3,}\b', job_lower))
328
+ text_words = set(re.findall(r'\b[a-zA-Z]{3,}\b', text_lower))
329
+ job_specific = list(job_words.intersection(text_words))[:10] # Top 10
330
+
331
+ return {
332
+ 'technical_skills': found_skills,
333
+ 'job_specific_keywords': job_specific,
334
+ 'total_skills': len(found_skills) + len(job_specific)
335
+ }
336
+
337
+ def add_bm25_scores(self, results_df, resume_texts, job_description):
338
+ """Add BM25 scores to results"""
339
+ bm25_scores = self.calculate_bm25_scores(resume_texts, job_description)
340
+ results_df['bm25_score'] = bm25_scores
341
+ return results_df
342
+
343
+ def add_intent_scores(self, results_df, resume_texts, job_description):
344
+ """Add intent analysis scores"""
345
+ intent_labels = []
346
+ intent_scores = []
347
+
348
+ progress_bar = st.progress(0)
349
+ for i, text in enumerate(resume_texts):
350
+ label, score = self.analyze_intent(text, job_description)
351
+ intent_labels.append(label)
352
+ intent_scores.append(score)
353
+ progress_bar.progress((i + 1) / len(resume_texts))
354
+
355
+ results_df['intent_label'] = intent_labels
356
+ results_df['intent_score'] = intent_scores
357
+ return results_df
358
+
359
+ def calculate_final_scores(self, results_df):
360
+ """Calculate final weighted scores"""
361
+ # Normalize scores to 0-1 range
362
+ if 'cross_encoder_score' in results_df.columns:
363
+ ce_scores = (results_df['cross_encoder_score'] - results_df['cross_encoder_score'].min()) / \
364
+ (results_df['cross_encoder_score'].max() - results_df['cross_encoder_score'].min() + 1e-8)
365
+ else:
366
+ ce_scores = np.zeros(len(results_df))
367
+
368
+ if 'bm25_score' in results_df.columns:
369
+ bm25_scores = (results_df['bm25_score'] - results_df['bm25_score'].min()) / \
370
+ (results_df['bm25_score'].max() - results_df['bm25_score'].min() + 1e-8)
371
+ else:
372
+ bm25_scores = np.zeros(len(results_df))
373
+
374
+ intent_scores = results_df.get('intent_score', np.ones(len(results_df)) * 0.5)
375
+
376
+ # Weighted combination
377
+ final_scores = 0.5 * ce_scores + 0.3 * bm25_scores + 0.2 * intent_scores
378
+ results_df['final_score'] = final_scores
379
+
380
+ return results_df.sort_values('final_score', ascending=False)
381
+
382
+ def advanced_pipeline_ranking(self, resume_texts, resume_filenames, job_description):
383
+ """Run the complete advanced pipeline"""
384
+ st.info("🚀 Starting advanced pipeline ranking...")
385
+
386
+ # Stage 1: FAISS Recall
387
+ st.text("Stage 1: FAISS-based recall (top 50 candidates)")
388
+ top_50_indices = self.faiss_recall(resume_texts, job_description, top_k=50)
389
+
390
+ # Stage 2: Cross-encoder reranking
391
+ st.text("Stage 2: Cross-encoder reranking (top 20 candidates)")
392
+ top_20_indices = self.cross_encoder_rerank(resume_texts, job_description, top_50_indices, top_k=20)
393
+
394
+ # Create results dataframe
395
+ results_df = pd.DataFrame({
396
+ 'rank': range(1, len(top_20_indices) + 1),
397
+ 'filename': [resume_filenames[i] for i in top_20_indices],
398
+ 'resume_index': top_20_indices
399
+ })
400
+
401
+ # Stage 3: Add cross-encoder scores
402
+ st.text("Stage 3: Adding detailed cross-encoder scores")
403
+ if self.cross_encoder:
404
+ pairs = [(job_description, resume_texts[i]) for i in top_20_indices]
405
+ ce_scores = self.cross_encoder.predict(pairs)
406
+ results_df['cross_encoder_score'] = ce_scores
407
+
408
+ # Stage 4: Add BM25 scores
409
+ st.text("Stage 4: Adding BM25 scores")
410
+ top_20_texts = [resume_texts[i] for i in top_20_indices]
411
+ results_df = self.add_bm25_scores(results_df, top_20_texts, job_description)
412
+
413
+ # Stage 5: Add intent analysis
414
+ st.text("Stage 5: Analyzing candidate intent")
415
+ results_df = self.add_intent_scores(results_df, top_20_texts, job_description)
416
+
417
+ # Calculate final scores
418
+ st.text("Calculating final weighted scores...")
419
+ results_df = self.calculate_final_scores(results_df)
420
+
421
+ # Add skills analysis
422
+ st.text("Extracting skills and keywords...")
423
+ skills_data = []
424
+ for i in top_20_indices:
425
+ skills = self.extract_skills(resume_texts[i], job_description)
426
+ skills_data.append({
427
+ 'top_skills': ', '.join(skills['technical_skills'][:5]),
428
+ 'job_keywords': ', '.join(skills['job_specific_keywords'][:5]),
429
+ 'total_skills_count': skills['total_skills']
430
+ })
431
+
432
+ skills_df = pd.DataFrame(skills_data)
433
+ results_df = pd.concat([results_df, skills_df], axis=1)
434
+
435
+ st.success("✅ Pipeline completed successfully!")
436
+ return results_df
437
+
438
+ # Load models on startup
439
+ if ML_IMPORTS_AVAILABLE and not st.session_state.models_loaded:
440
+ load_models()
441
+
442
+ # Initialize screener
443
+ if ML_IMPORTS_AVAILABLE and st.session_state.models_loaded:
444
+ screener = ResumeScreener()
445
+
446
+ # Sidebar
447
+ with st.sidebar:
448
+ st.title("🤖 AI Resume Screener")
449
+ st.markdown("---")
450
+
451
+ st.subheader("📋 Pipeline Stages")
452
+ st.markdown("""
453
+ 1. **FAISS Recall**: Semantic similarity search (top 50)
454
+ 2. **Cross-Encoder**: Deep reranking (top 20)
455
+ 3. **BM25 Scoring**: Keyword-based relevance
456
+ 4. **Intent Analysis**: AI-powered candidate intent
457
+ 5. **Final Ranking**: Weighted score combination
458
+ """)
459
+
460
+ st.subheader("🧠 AI Models")
461
+ if st.session_state.models_loaded:
462
+ st.success("✅ Embedding: BGE-Large-EN")
463
+ st.success("✅ Cross-Encoder: MS-Marco-MiniLM")
464
+ st.success("✅ LLM: Qwen2-1.5B")
465
+ else:
466
+ st.warning("⏳ Models loading...")
467
+
468
+ if st.session_state.model_errors:
469
+ st.error("❌ Model Errors:")
470
+ for model, error in st.session_state.model_errors.items():
471
+ st.text(f"{model}: {error[:100]}...")
472
+
473
+ st.subheader("📊 Scoring Formula")
474
+ st.markdown("""
475
+ **Final Score = 0.5 × Cross-Encoder + 0.3 × BM25 + 0.2 × Intent**
476
+
477
+ - Cross-Encoder: Deep semantic matching
478
+ - BM25: Keyword relevance
479
+ - Intent: Candidate interest level
480
+ """)
481
+
482
+ # Main content
483
+ st.title("🤖 AI Resume Screener")
484
+ st.markdown("Automatically rank candidate resumes against job descriptions using advanced AI")
485
+
486
+ # Step 1: Job Description Input
487
+ st.header("📝 Step 1: Job Description")
488
+ job_description = st.text_area(
489
+ "Enter the job description:",
490
+ height=200,
491
+ placeholder="Paste the complete job description here..."
492
+ )
493
+
494
+ # Step 2: Resume Upload
495
+ st.header("📄 Step 2: Load Resumes")
496
+
497
+ upload_option = st.radio(
498
+ "Choose how to load resumes:",
499
+ ["Upload Files", "Upload CSV", "Load from Hugging Face Dataset"]
500
+ )
501
+
502
+ if upload_option == "Upload Files":
503
+ uploaded_files = st.file_uploader(
504
+ "Upload resume files",
505
+ type=['pdf', 'docx', 'txt'],
506
+ accept_multiple_files=True
507
+ )
508
+
509
+ if uploaded_files and st.button("Process Uploaded Files"):
510
+ with st.spinner("Processing files..."):
511
+ texts = []
512
+ filenames = []
513
+
514
+ for file in uploaded_files:
515
+ if ML_IMPORTS_AVAILABLE and st.session_state.models_loaded:
516
+ text = screener.extract_text_from_file(file)
517
+ if text:
518
+ texts.append(text)
519
+ filenames.append(file.name)
520
+ else:
521
+ st.error("Models not loaded. Cannot process files.")
522
+ break
523
+
524
+ st.session_state.resume_texts = texts
525
+ st.session_state.resume_filenames = filenames
526
+ st.success(f"✅ Processed {len(texts)} resumes")
527
+
528
+ elif upload_option == "Upload CSV":
529
+ csv_file = st.file_uploader("Upload CSV with resume texts", type=['csv'])
530
+
531
+ if csv_file:
532
+ df = pd.read_csv(csv_file)
533
+ st.write("CSV Preview:", df.head())
534
+
535
+ text_column = st.selectbox("Select text column:", df.columns)
536
+ name_column = st.selectbox("Select name/ID column:", df.columns)
537
+
538
+ if st.button("Load from CSV"):
539
+ st.session_state.resume_texts = df[text_column].fillna("").tolist()
540
+ st.session_state.resume_filenames = df[name_column].fillna("Unknown").tolist()
541
+ st.success(f"✅ Loaded {len(st.session_state.resume_texts)} resumes from CSV")
542
+
543
+ elif upload_option == "Load from Hugging Face Dataset":
544
+ dataset_name = st.text_input("Dataset name:", "resume-dataset/resume-screening")
545
+
546
+ if st.button("Load Dataset"):
547
+ try:
548
+ with st.spinner("Loading dataset..."):
549
+ dataset = load_dataset(dataset_name, split="train")
550
+
551
+ # Try to identify text and name columns
552
+ columns = dataset.column_names
553
+ text_col = st.selectbox("Select text column:", columns)
554
+ name_col = st.selectbox("Select name/ID column:", columns)
555
+
556
+ if text_col and name_col:
557
+ st.session_state.resume_texts = dataset[text_col][:100] # Limit to 100
558
+ st.session_state.resume_filenames = [f"Resume_{i}" for i in range(len(st.session_state.resume_texts))]
559
+ st.success(f"✅ Loaded {len(st.session_state.resume_texts)} resumes from dataset")
560
+ except Exception as e:
561
+ st.error(f"Error loading dataset: {e}")
562
+
563
+ # Display current resume count
564
+ if st.session_state.resume_texts:
565
+ st.info(f"📊 Currently loaded: {len(st.session_state.resume_texts)} resumes")
566
+
567
+ # Step 3: Run Pipeline
568
+ st.header("🚀 Step 3: Run Advanced Pipeline")
569
+
570
+ can_run = (
571
+ ML_IMPORTS_AVAILABLE and
572
+ st.session_state.models_loaded and
573
+ job_description.strip() and
574
+ st.session_state.resume_texts
575
+ )
576
+
577
+ if st.button("🎯 Run Advanced Ranking Pipeline", disabled=not can_run):
578
+ if not can_run:
579
+ if not ML_IMPORTS_AVAILABLE:
580
+ st.error("❌ ML libraries not available")
581
+ elif not st.session_state.models_loaded:
582
+ st.error("❌ Models not loaded")
583
+ elif not job_description.strip():
584
+ st.error("❌ Please enter a job description")
585
+ elif not st.session_state.resume_texts:
586
+ st.error("❌ Please load some resumes")
587
+ else:
588
+ with st.spinner("Running advanced pipeline..."):
589
+ results = screener.advanced_pipeline_ranking(
590
+ st.session_state.resume_texts,
591
+ st.session_state.resume_filenames,
592
+ job_description
593
+ )
594
+ st.session_state.results = results
595
+
596
+ # Display Results
597
+ if st.session_state.results is not None:
598
+ st.header("📊 Results")
599
+
600
+ # Create tabs for different views
601
+ tab1, tab2, tab3 = st.tabs(["📋 Summary", "🔍 Detailed Analysis", "📈 Visualizations"])
602
+
603
+ with tab1:
604
+ st.subheader("Top Ranked Candidates")
605
+
606
+ # Style the dataframe
607
+ display_df = st.session_state.results[['rank', 'filename', 'final_score', 'cross_encoder_score',
608
+ 'bm25_score', 'intent_score', 'intent_label', 'top_skills']].copy()
609
+ display_df['final_score'] = display_df['final_score'].round(3)
610
+ display_df['cross_encoder_score'] = display_df['cross_encoder_score'].round(3)
611
+ display_df['bm25_score'] = display_df['bm25_score'].round(3)
612
+ display_df['intent_score'] = display_df['intent_score'].round(3)
613
+
614
+ st.dataframe(display_df, use_container_width=True)
615
+
616
+ # Download link
617
+ csv = display_df.to_csv(index=False)
618
+ b64 = base64.b64encode(csv.encode()).decode()
619
+ href = f'<a href="data:file/csv;base64,{b64}" download="resume_rankings.csv">📥 Download Results as CSV</a>'
620
+ st.markdown(href, unsafe_allow_html=True)
621
+
622
+ with tab2:
623
+ st.subheader("Detailed Candidate Analysis")
624
+
625
+ for idx, row in st.session_state.results.iterrows():
626
+ with st.expander(f"#{row['rank']} - {row['filename']} (Score: {row['final_score']:.3f})"):
627
+ col1, col2 = st.columns(2)
628
+
629
+ with col1:
630
+ st.metric("Final Score", f"{row['final_score']:.3f}")
631
+ st.metric("Cross-Encoder", f"{row['cross_encoder_score']:.3f}")
632
+ st.metric("BM25 Score", f"{row['bm25_score']:.3f}")
633
+
634
+ with col2:
635
+ st.metric("Intent Score", f"{row['intent_score']:.3f}")
636
+ st.metric("Intent Label", row['intent_label'])
637
+ st.metric("Skills Count", row['total_skills_count'])
638
+
639
+ st.write("**Top Skills:**", row['top_skills'])
640
+ st.write("**Job Keywords:**", row['job_keywords'])
641
+
642
+ # Show resume excerpt
643
+ resume_text = st.session_state.resume_texts[row['resume_index']]
644
+ st.text_area("Resume Excerpt:", resume_text[:500] + "...", height=100, key=f"excerpt_{idx}")
645
+
646
+ with tab3:
647
+ st.subheader("Score Visualizations")
648
+
649
+ # Score distribution
650
+ fig1 = px.bar(
651
+ st.session_state.results.head(10),
652
+ x='filename',
653
+ y='final_score',
654
+ title="Top 10 Candidates - Final Scores",
655
+ color='final_score',
656
+ color_continuous_scale='viridis'
657
+ )
658
+ fig1.update_xaxis(tickangle=45)
659
+ st.plotly_chart(fig1, use_container_width=True)
660
+
661
+ # Score breakdown
662
+ score_cols = ['cross_encoder_score', 'bm25_score', 'intent_score']
663
+ fig2 = go.Figure()
664
+
665
+ for i, col in enumerate(score_cols):
666
+ fig2.add_trace(go.Bar(
667
+ name=col.replace('_', ' ').title(),
668
+ x=st.session_state.results['filename'].head(10),
669
+ y=st.session_state.results[col].head(10)
670
+ ))
671
+
672
+ fig2.update_layout(
673
+ title="Score Breakdown - Top 10 Candidates",
674
+ barmode='group',
675
+ xaxis_tickangle=45
676
+ )
677
+ st.plotly_chart(fig2, use_container_width=True)
678
+
679
+ # Intent distribution
680
+ intent_counts = st.session_state.results['intent_label'].value_counts()
681
+ fig3 = px.pie(
682
+ values=intent_counts.values,
683
+ names=intent_counts.index,
684
+ title="Candidate Intent Distribution"
685
+ )
686
+ st.plotly_chart(fig3, use_container_width=True)
687
+
688
+ # Average metrics
689
+ col1, col2, col3, col4 = st.columns(4)
690
+ with col1:
691
+ st.metric("Avg Final Score", f"{st.session_state.results['final_score'].mean():.3f}")
692
+ with col2:
693
+ st.metric("Avg Cross-Encoder", f"{st.session_state.results['cross_encoder_score'].mean():.3f}")
694
+ with col3:
695
+ st.metric("Avg BM25", f"{st.session_state.results['bm25_score'].mean():.3f}")
696
+ with col4:
697
+ st.metric("Avg Intent", f"{st.session_state.results['intent_score'].mean():.3f}")
698
+
699
+ # Cleanup Controls
700
+ st.header("🧹 Cleanup")
701
+ col1, col2 = st.columns(2)
702
+
703
+ with col1:
704
+ if st.button("Clear Resumes Only"):
705
+ st.session_state.resume_texts = []
706
+ st.session_state.resume_filenames = []
707
+ st.session_state.results = None
708
+ st.success("✅ Resumes cleared")
709
+
710
+ with col2:
711
+ if st.button("Reset Entire App"):
712
+ # Clear all session state
713
+ for key in list(st.session_state.keys()):
714
+ del st.session_state[key]
715
+
716
+ # Free GPU memory
717
+ if torch.cuda.is_available():
718
+ torch.cuda.empty_cache()
719
+
720
+ st.success("✅ App reset complete")
721
+ st.experimental_rerun()
722
 
723
+ # Footer
724
+ st.markdown("---")
725
+ st.markdown(
726
+ """
727
+ <div style='text-align: center; color: #666; font-size: 0.8em;'>
728
+ 🤖 Powered by BGE-Large-EN, MS-Marco-MiniLM, Qwen2-1.5B | Built with Streamlit
729
+ </div>
730
+ """,
731
+ unsafe_allow_html=True
732
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_installation.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify AI Resume Screener installation
4
+ """
5
+
6
+ import sys
7
+ import importlib
8
+
9
+ def test_import(module_name, package_name=None):
10
+ """Test if a module can be imported"""
11
+ try:
12
+ importlib.import_module(module_name)
13
+ print(f"✅ {package_name or module_name}")
14
+ return True
15
+ except ImportError as e:
16
+ print(f"❌ {package_name or module_name}: {e}")
17
+ return False
18
+
19
+ def main():
20
+ print("🧪 Testing AI Resume Screener Installation\n")
21
+
22
+ # Core dependencies
23
+ print("📦 Core Dependencies:")
24
+ core_deps = [
25
+ ("streamlit", "Streamlit"),
26
+ ("pandas", "Pandas"),
27
+ ("numpy", "NumPy"),
28
+ ("plotly", "Plotly"),
29
+ ]
30
+
31
+ core_success = all(test_import(module, name) for module, name in core_deps)
32
+
33
+ # ML/AI dependencies
34
+ print("\n🤖 ML/AI Dependencies:")
35
+ ml_deps = [
36
+ ("sentence_transformers", "Sentence Transformers"),
37
+ ("transformers", "Transformers"),
38
+ ("torch", "PyTorch"),
39
+ ("faiss", "FAISS"),
40
+ ("rank_bm25", "Rank BM25"),
41
+ ("nltk", "NLTK"),
42
+ ]
43
+
44
+ ml_success = all(test_import(module, name) for module, name in ml_deps)
45
+
46
+ # File processing dependencies
47
+ print("\n📄 File Processing Dependencies:")
48
+ file_deps = [
49
+ ("pdfplumber", "PDF Plumber"),
50
+ ("PyPDF2", "PyPDF2"),
51
+ ("docx", "python-docx"),
52
+ ("datasets", "Hugging Face Datasets"),
53
+ ]
54
+
55
+ file_success = all(test_import(module, name) for module, name in file_deps)
56
+
57
+ # Optional dependencies
58
+ print("\n⚡ Optional Dependencies:")
59
+ optional_deps = [
60
+ ("accelerate", "Accelerate"),
61
+ ("bitsandbytes", "BitsAndBytes"),
62
+ ]
63
+
64
+ for module, name in optional_deps:
65
+ test_import(module, name)
66
+
67
+ # Summary
68
+ print("\n" + "="*50)
69
+ if core_success and ml_success and file_success:
70
+ print("🎉 All required dependencies are installed!")
71
+ print("✅ Ready to run AI Resume Screener")
72
+
73
+ # Test basic functionality
74
+ print("\n🔧 Testing basic functionality...")
75
+ try:
76
+ import pandas as pd
77
+ import numpy as np
78
+ from sentence_transformers import SentenceTransformer
79
+
80
+ # Test data creation
81
+ test_df = pd.DataFrame({'test': [1, 2, 3]})
82
+ test_array = np.array([1, 2, 3])
83
+
84
+ print("✅ Pandas and NumPy working")
85
+ print("✅ Installation test completed successfully!")
86
+
87
+ except Exception as e:
88
+ print(f"❌ Basic functionality test failed: {e}")
89
+
90
+ else:
91
+ print("❌ Some required dependencies are missing")
92
+ print("📝 Please install missing packages using:")
93
+ print(" pip install -r requirements.txt")
94
+
95
+ print("\n🚀 To run the application:")
96
+ print(" streamlit run src/streamlit_app.py")
97
+
98
+ if __name__ == "__main__":
99
+ main()