File Structure
Chatbot Evaluation System - File Structure and Organization
Project Root Structure
chatbot_eval/
├── 📁 chatbot architecture/ # This documentation folder
│ ├── business_overview.md # Business context and value proposition
│ ├── system_architecture.md # High-level system design
│ ├── extractive_qa_models.md # Extractive QA model details
│ ├── generative_chat_models.md # API and generative model details
│ ├── local_chat_models.md # Local model implementation
│ ├── technical_implementation.md # Code architecture and patterns
│ ├── file_structure.md # This file - project organization
│ └── deployment_guide.md # Deployment and operations
├── 📁 templates/ # HTML templates for web UI
│ ├── index.html # Main evaluation interface
│ └── test.html # Testing and debugging interface
├── 📁 static/ # Frontend assets
│ ├── style.css # Main stylesheet
│ └── script.js # Frontend JavaScript logic
├── 📁 models/ # Model implementations and providers
│ ├── __init__.py # Package initialization
│ ├── qa.py # Model registry and provider wrappers
│ └── loader.py # Model loading utilities
├── 📁 metrics/ # Evaluation metrics implementations
│ ├── __init__.py # Package initialization
│ ├── ngram.py # BLEU, ROUGE-L, METEOR metrics
│ ├── semsim.py # Semantic similarity computation
│ ├── diversity.py # Text diversity metrics
│ ├── bert_score.py # BERT-based scoring
│ ├── bleurt.py # BLEURT evaluation metric
│ ├── comet.py # COMET metric implementation
│ ├── toxicity.py # Toxicity detection
│ └── latency.py # Performance measurement
├── 📁 utils/ # Utility functions and helpers
│ ├── __init__.py # Package initialization
│ ├── config.py # Configuration management
│ └── aggregator.py # Results aggregation utilities
├── 📁 tests/ # Test files and validation
│ ├── test_metrics.py # Metrics testing
│ ├── test_openrouter.py # OpenRouter integration tests
│ └── test_runner.py # Core evaluation tests
├── 📁 examples/ # Example data and configurations
│ └── sample.json # Sample evaluation data
├── 📁 __pycache__/ # Python bytecode cache
├── app.py # Main Flask application
├── runner.py # Core evaluation engine
├── cli.py # Command-line interface
├── test.py # Testing utilities
├── verify_install.py # Installation verification
├── requirements.txt # Python dependencies
├── config.json # Runtime configuration
├── config.example.json # Configuration template
├── README.md # Project documentation
└── summary.md # Comprehensive project summary
Detailed File Organization
1. Core Application Files
app.py - Main Flask Application
Purpose: Web server setup, route handling, and application lifecycle management
Key Components:
- Flask application configuration and middleware
- Route handlers for /, /evaluate, /health, /health/models
- Model preloading and cache warming
- Error handling and logging setup
Code Organization:
# Section 1: Imports and Setup
# Section 2: Flask Application Configuration
# Section 3: Route Handlers (/, /evaluate, /health, /health/models)
# Section 4: Utility Functions (caching, warming, etc.)
# Section 5: Main Execution Block
runner.py - Evaluation Engine
Purpose: Core logic for chatbot evaluation, model inference, and metrics computation Key Components: - Question processing and small talk detection - Context augmentation with knowledge bases - Model selection and inference orchestration - Response post-processing and sanitization - Metrics computation coordination - Fallback mechanisms for reliability
Code Organization:
# Section 1: Imports and Constants
# Section 2: Utility Functions (sanitization, normalization)
# Section 3: Core Evaluation Function (generate_and_evaluate)
# Section 4: Small Talk and Context Management
# Section 5: Model Inference Pipeline
# Section 6: Fallback Mechanisms
# Section 7: Metrics Computation
# Section 8: Batch Evaluation Function
2. Model Layer (models/)
models/qa.py - Model Registry and Providers
Purpose: Unified interface for all model providers and model management
Key Components:
- Model registry (SUPPORTED_MODELS)
- Provider-specific wrapper functions
- Model preloading and caching logic
- API client initialization and management
Code Organization:
# Section 1: Imports and Model Registry
# Section 2: Provider Wrapper Functions
# ├── OpenAI Wrapper (_build_openai_wrapper)
# ├── Gemini Wrapper (_build_gemini_wrapper)
# ├── OpenRouter Wrapper (_build_openrouter_wrapper)
# ├── Hugging Face QA Wrapper (_build_hf_qa_wrapper)
# └── Local Chat Wrapper (_build_hf_chat_wrapper)
# Section 3: Model Preloading (preload_qa_models)
# Section 4: Pipeline Retrieval (get_qa_pipeline)
Model Categories: 1. Extractive QA Models: Direct answer extraction from context 2. API-based Chat Models: External generative models via APIs 3. Local Chat Models: Self-hosted generative models
models/loader.py - Model Loading Utilities
Purpose: Efficient model loading and device management Key Components: - Lazy loading implementation for memory efficiency - Device selection and management (CPU/GPU) - Model caching and memory optimization
Key Functions:
- load_transformer_model_for_generation(): Load generative models
- set_device_policy(): Configure compute device preferences
- get_device(): Get current compute device
3. Metrics Layer (metrics/)
Metrics Package Structure
Each metric module follows a consistent pattern:
def compute(predictions: List[str], references: List[List[str]], options: Dict) -> Dict:
"""
Compute metric scores for given predictions and references.
Args:
predictions: List of predicted responses
references: List of reference response lists
options: Computation options (device, model_type, etc.)
Returns:
Dict with 'aggregate' scores and 'per_item' breakdowns
"""
metrics/ngram.py - BLEU/ROUGE/METEOR
Purpose: Traditional n-gram based evaluation metrics Key Metrics: - BLEU (Bilingual Evaluation Understudy) - ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) - METEOR (Metric for Evaluation of Translation with Explicit Ordering)
metrics/semsim.py - Semantic Similarity
Purpose: Deep learning based semantic evaluation Implementation: - Sentence Transformers for embedding computation - Cosine similarity for semantic matching - Contextual understanding of response quality
metrics/bert_score.py - BERT-based Scoring
Purpose: Advanced semantic evaluation using BERT Features: - Contextual token-level similarity - Precision, Recall, and F1 scoring - Multiple BERT model variants support
metrics/bleurt.py - BLEURT Metric
Purpose: Neural evaluation metric trained on human judgments Features: - Learned evaluation criteria - Human-like quality assessment - Large model requirement (optional)
metrics/comet.py - COMET Evaluation
Purpose: Reference-free evaluation metric Features: - Quality estimation without references - Multi-dimensional scoring - Cross-lingual capabilities
metrics/toxicity.py - Toxicity Detection
Purpose: Safety and content quality assessment Implementation: - Perspective API integration - Toxicity probability scoring - Configurable thresholds
metrics/latency.py - Performance Metrics
Purpose: Response time and throughput measurement Metrics: - End-to-end latency - Model inference time - Metrics computation time
4. Utilities Layer (utils/)
utils/config.py - Configuration Management
Purpose: Centralized configuration and API key management
Key Functions:
- get_api_key(): Retrieve API keys from environment or config files
- get_openrouter_headers(): Generate OpenRouter-specific headers
- validate_configuration(): Validate system configuration
utils/aggregator.py - Results Aggregation
Purpose: Combine and format evaluation results
Key Functions:
- aggregate_metrics(): Combine multiple metric results
- format_results(): Prepare results for API response
- calculate_confidence_intervals(): Statistical analysis of results
5. Frontend Layer (templates/, static/)
templates/index.html - Main Interface
Purpose: User interface for chatbot evaluation Components: - Model selection dropdown - Question input with hints - Advanced hyperparameters configuration - Results display area - Provider debugging interface
static/script.js - Frontend Logic
Purpose: Client-side evaluation and UI management Key Functions: - Form submission and API communication - Results rendering and visualization - Model metadata display - Error handling and user feedback
static/style.css - Styling
Purpose: Modern, responsive UI design Features: - Clean, professional appearance - Mobile-responsive design - Dark/light theme support - Accessible color schemes
6. Configuration and Infrastructure
config.json - Runtime Configuration
{
"openai_api_key": "sk-...",
"gemini_api_key": "AIza...",
"openrouter_api_key": "sk-or-...",
"openrouter_site_url": "https://your-site.example",
"openrouter_site_name": "Your App Name"
}
requirements.txt - Dependencies
Core Dependencies:
- flask>=3.0.0: Web framework
- transformers>=4.42.0: Hugging Face models
- torch>=2.1.0: Deep learning framework
- openai>=1.35.0: OpenAI API client
- google-generativeai>=0.7.0: Gemini API client
Evaluation Dependencies:
- evaluate>=0.4.1: Metrics computation framework
- sacrebleu>=2.4.0: BLEU metric implementation
- rouge-score>=0.1.2: ROUGE metric implementation
- bert-score>=0.3.13: BERTScore metric
- sentence-transformers>=2.2.2: Semantic similarity
7. Testing and Validation (tests/, examples/)
Test Files:
test_metrics.py: Validate metric computationstest_openrouter.py: Test OpenRouter integrationtest_runner.py: Test core evaluation functionality
Example Files:
sample.json: Sample evaluation data formatconfig.example.json: Configuration template
8. Documentation and Deployment
README.md - Project Documentation
Sections: - Feature overview and installation - Configuration and API key setup - Usage examples and troubleshooting - Model support and performance notes
verify_install.py - Installation Verification
Purpose: Validate system setup and functionality Tests: - Model loading verification - API connectivity testing - Metrics computation validation - End-to-end evaluation testing
cli.py - Command Line Interface
Purpose: Batch evaluation and scripting support Features: - File-based evaluation - CSV export functionality - Batch processing capabilities
Development Workflow
Code Organization Principles
- Separation of Concerns: Each module has a single responsibility
- Consistent Interfaces: All model providers implement the same interface
- Error Handling: Comprehensive error handling with graceful degradation
- Configuration Management: Environment-based configuration with validation
- Performance Optimization: Lazy loading and caching for efficiency
File Naming Conventions
- Python modules:
snake_case.py(e.g.,qa.py,semsim.py) - Classes:
PascalCase(e.g.,OpenAIProvider,MetricsEngine) - Functions:
snake_case(e.g.,compute_metrics,load_model) - Constants:
UPPER_CASE(e.g.,SUPPORTED_MODELS,MEDICAL_SYSTEM_PROMPT)
Import Structure
Clean Import Organization:
# Standard library imports first
import os
import logging
from typing import Dict, List, Any
# Third-party imports
from flask import Flask, jsonify
from transformers import pipeline
# Local imports last
from models.qa import SUPPORTED_MODELS
from utils.config import get_api_key
Deployment Organization
Production Deployment Structure
deployment/
├── docker/
│ ├── Dockerfile # Main application container
│ ├── docker-compose.yml # Multi-service orchestration
│ └── nginx.conf # Reverse proxy configuration
├── kubernetes/
│ ├── deployment.yml # Kubernetes deployment
│ ├── service.yml # Service definition
│ └── ingress.yml # Ingress configuration
├── monitoring/
│ ├── prometheus.yml # Metrics collection
│ └── grafana/ # Dashboards
└── scripts/
├── deploy.sh # Deployment automation
└── health_check.sh # Health monitoring
Configuration Hierarchy
- Environment Variables: Runtime configuration (API keys, ports)
- Config Files: Persistent configuration (
config.json) - Code Defaults: Fallback values in source code
- Feature Flags: Optional component enabling
This organized file structure provides a solid foundation for maintaining, extending, and deploying the chatbot evaluation system while ensuring code quality, performance, and maintainability.