File Structure

Chatbot Evaluation System - File Structure and Organization

Project Root Structure

chatbot_eval/
├── 📁 chatbot architecture/          # This documentation folder
│   ├── business_overview.md         # Business context and value proposition
│   ├── system_architecture.md       # High-level system design
│   ├── extractive_qa_models.md      # Extractive QA model details
│   ├── generative_chat_models.md    # API and generative model details
│   ├── local_chat_models.md         # Local model implementation
│   ├── technical_implementation.md  # Code architecture and patterns
│   ├── file_structure.md           # This file - project organization
│   └── deployment_guide.md         # Deployment and operations
├── 📁 templates/                   # HTML templates for web UI
│   ├── index.html                  # Main evaluation interface
│   └── test.html                   # Testing and debugging interface
├── 📁 static/                     # Frontend assets
│   ├── style.css                  # Main stylesheet
│   └── script.js                  # Frontend JavaScript logic
├── 📁 models/                    # Model implementations and providers
│   ├── __init__.py               # Package initialization
│   ├── qa.py                     # Model registry and provider wrappers
│   └── loader.py                 # Model loading utilities
├── 📁 metrics/                   # Evaluation metrics implementations
│   ├── __init__.py               # Package initialization
│   ├── ngram.py                  # BLEU, ROUGE-L, METEOR metrics
│   ├── semsim.py                 # Semantic similarity computation
│   ├── diversity.py              # Text diversity metrics
│   ├── bert_score.py             # BERT-based scoring
│   ├── bleurt.py                 # BLEURT evaluation metric
│   ├── comet.py                  # COMET metric implementation
│   ├── toxicity.py               # Toxicity detection
│   └── latency.py                # Performance measurement
├── 📁 utils/                     # Utility functions and helpers
│   ├── __init__.py               # Package initialization
│   ├── config.py                 # Configuration management
│   └── aggregator.py             # Results aggregation utilities
├── 📁 tests/                     # Test files and validation
│   ├── test_metrics.py           # Metrics testing
│   ├── test_openrouter.py        # OpenRouter integration tests
│   └── test_runner.py            # Core evaluation tests
├── 📁 examples/                  # Example data and configurations
│   └── sample.json               # Sample evaluation data
├── 📁 __pycache__/               # Python bytecode cache
├── app.py                        # Main Flask application
├── runner.py                     # Core evaluation engine
├── cli.py                        # Command-line interface
├── test.py                       # Testing utilities
├── verify_install.py             # Installation verification
├── requirements.txt              # Python dependencies
├── config.json                   # Runtime configuration
├── config.example.json           # Configuration template
├── README.md                     # Project documentation
└── summary.md                    # Comprehensive project summary

Detailed File Organization

1. Core Application Files

`app.py` - Main Flask Application

Purpose: Web server setup, route handling, and application lifecycle management Key Components: - Flask application configuration and middleware - Route handlers for /, /evaluate, /health, /health/models - Model preloading and cache warming - Error handling and logging setup

Code Organization:

# Section 1: Imports and Setup
# Section 2: Flask Application Configuration
# Section 3: Route Handlers (/, /evaluate, /health, /health/models)
# Section 4: Utility Functions (caching, warming, etc.)
# Section 5: Main Execution Block

`runner.py` - Evaluation Engine

Purpose: Core logic for chatbot evaluation, model inference, and metrics computation Key Components: - Question processing and small talk detection - Context augmentation with knowledge bases - Model selection and inference orchestration - Response post-processing and sanitization - Metrics computation coordination - Fallback mechanisms for reliability

Code Organization:

# Section 1: Imports and Constants
# Section 2: Utility Functions (sanitization, normalization)
# Section 3: Core Evaluation Function (generate_and_evaluate)
# Section 4: Small Talk and Context Management
# Section 5: Model Inference Pipeline
# Section 6: Fallback Mechanisms
# Section 7: Metrics Computation
# Section 8: Batch Evaluation Function

2. Model Layer (`models/`)

`models/qa.py` - Model Registry and Providers

Purpose: Unified interface for all model providers and model management Key Components: - Model registry (SUPPORTED_MODELS) - Provider-specific wrapper functions - Model preloading and caching logic - API client initialization and management

Code Organization:

# Section 1: Imports and Model Registry
# Section 2: Provider Wrapper Functions
#    ├── OpenAI Wrapper (_build_openai_wrapper)
#    ├── Gemini Wrapper (_build_gemini_wrapper)
#    ├── OpenRouter Wrapper (_build_openrouter_wrapper)
#    ├── Hugging Face QA Wrapper (_build_hf_qa_wrapper)
#    └── Local Chat Wrapper (_build_hf_chat_wrapper)
# Section 3: Model Preloading (preload_qa_models)
# Section 4: Pipeline Retrieval (get_qa_pipeline)

Model Categories: 1. Extractive QA Models: Direct answer extraction from context 2. API-based Chat Models: External generative models via APIs 3. Local Chat Models: Self-hosted generative models

`models/loader.py` - Model Loading Utilities

Purpose: Efficient model loading and device management Key Components: - Lazy loading implementation for memory efficiency - Device selection and management (CPU/GPU) - Model caching and memory optimization

Key Functions: - load_transformer_model_for_generation(): Load generative models - set_device_policy(): Configure compute device preferences - get_device(): Get current compute device

3. Metrics Layer (`metrics/`)

Metrics Package Structure

Each metric module follows a consistent pattern:

def compute(predictions: List[str], references: List[List[str]], options: Dict) -> Dict:
    """
    Compute metric scores for given predictions and references.

    Args:
        predictions: List of predicted responses
        references: List of reference response lists
        options: Computation options (device, model_type, etc.)

    Returns:
        Dict with 'aggregate' scores and 'per_item' breakdowns
    """

`metrics/ngram.py` - BLEU/ROUGE/METEOR

Purpose: Traditional n-gram based evaluation metrics Key Metrics: - BLEU (Bilingual Evaluation Understudy) - ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) - METEOR (Metric for Evaluation of Translation with Explicit Ordering)

`metrics/semsim.py` - Semantic Similarity

Purpose: Deep learning based semantic evaluation Implementation: - Sentence Transformers for embedding computation - Cosine similarity for semantic matching - Contextual understanding of response quality

`metrics/bert_score.py` - BERT-based Scoring

Purpose: Advanced semantic evaluation using BERT Features: - Contextual token-level similarity - Precision, Recall, and F1 scoring - Multiple BERT model variants support

`metrics/bleurt.py` - BLEURT Metric

Purpose: Neural evaluation metric trained on human judgments Features: - Learned evaluation criteria - Human-like quality assessment - Large model requirement (optional)

`metrics/comet.py` - COMET Evaluation

Purpose: Reference-free evaluation metric Features: - Quality estimation without references - Multi-dimensional scoring - Cross-lingual capabilities

`metrics/toxicity.py` - Toxicity Detection

Purpose: Safety and content quality assessment Implementation: - Perspective API integration - Toxicity probability scoring - Configurable thresholds

`metrics/latency.py` - Performance Metrics

Purpose: Response time and throughput measurement Metrics: - End-to-end latency - Model inference time - Metrics computation time

4. Utilities Layer (`utils/`)

`utils/config.py` - Configuration Management

Purpose: Centralized configuration and API key management Key Functions: - get_api_key(): Retrieve API keys from environment or config files - get_openrouter_headers(): Generate OpenRouter-specific headers - validate_configuration(): Validate system configuration

`utils/aggregator.py` - Results Aggregation

Purpose: Combine and format evaluation results Key Functions: - aggregate_metrics(): Combine multiple metric results - format_results(): Prepare results for API response - calculate_confidence_intervals(): Statistical analysis of results

5. Frontend Layer (`templates/`, `static/`)

`templates/index.html` - Main Interface

Purpose: User interface for chatbot evaluation Components: - Model selection dropdown - Question input with hints - Advanced hyperparameters configuration - Results display area - Provider debugging interface

`static/script.js` - Frontend Logic

Purpose: Client-side evaluation and UI management Key Functions: - Form submission and API communication - Results rendering and visualization - Model metadata display - Error handling and user feedback

`static/style.css` - Styling

Purpose: Modern, responsive UI design Features: - Clean, professional appearance - Mobile-responsive design - Dark/light theme support - Accessible color schemes

6. Configuration and Infrastructure

`config.json` - Runtime Configuration

{
  "openai_api_key": "sk-...",
  "gemini_api_key": "AIza...",
  "openrouter_api_key": "sk-or-...",
  "openrouter_site_url": "https://your-site.example",
  "openrouter_site_name": "Your App Name"
}

`requirements.txt` - Dependencies

Core Dependencies: - flask>=3.0.0: Web framework - transformers>=4.42.0: Hugging Face models - torch>=2.1.0: Deep learning framework - openai>=1.35.0: OpenAI API client - google-generativeai>=0.7.0: Gemini API client

Evaluation Dependencies: - evaluate>=0.4.1: Metrics computation framework - sacrebleu>=2.4.0: BLEU metric implementation - rouge-score>=0.1.2: ROUGE metric implementation - bert-score>=0.3.13: BERTScore metric - sentence-transformers>=2.2.2: Semantic similarity

7. Testing and Validation (`tests/`, `examples/`)

Test Files:

test_metrics.py: Validate metric computations
test_openrouter.py: Test OpenRouter integration
test_runner.py: Test core evaluation functionality

Example Files:

sample.json: Sample evaluation data format
config.example.json: Configuration template

8. Documentation and Deployment

`README.md` - Project Documentation

Sections: - Feature overview and installation - Configuration and API key setup - Usage examples and troubleshooting - Model support and performance notes

`verify_install.py` - Installation Verification

Purpose: Validate system setup and functionality Tests: - Model loading verification - API connectivity testing - Metrics computation validation - End-to-end evaluation testing

`cli.py` - Command Line Interface

Purpose: Batch evaluation and scripting support Features: - File-based evaluation - CSV export functionality - Batch processing capabilities

Development Workflow

Code Organization Principles

Separation of Concerns: Each module has a single responsibility
Consistent Interfaces: All model providers implement the same interface
Error Handling: Comprehensive error handling with graceful degradation
Configuration Management: Environment-based configuration with validation
Performance Optimization: Lazy loading and caching for efficiency

File Naming Conventions

Python modules: snake_case.py (e.g., qa.py, semsim.py)
Classes: PascalCase (e.g., OpenAIProvider, MetricsEngine)
Functions: snake_case (e.g., compute_metrics, load_model)
Constants: UPPER_CASE (e.g., SUPPORTED_MODELS, MEDICAL_SYSTEM_PROMPT)

Import Structure

Clean Import Organization:

# Standard library imports first
import os
import logging
from typing import Dict, List, Any

# Third-party imports
from flask import Flask, jsonify
from transformers import pipeline

# Local imports last
from models.qa import SUPPORTED_MODELS
from utils.config import get_api_key

Deployment Organization

Production Deployment Structure

deployment/
├── docker/
│   ├── Dockerfile                # Main application container
│   ├── docker-compose.yml       # Multi-service orchestration
│   └── nginx.conf              # Reverse proxy configuration
├── kubernetes/
│   ├── deployment.yml          # Kubernetes deployment
│   ├── service.yml             # Service definition
│   └── ingress.yml             # Ingress configuration
├── monitoring/
│   ├── prometheus.yml          # Metrics collection
│   └── grafana/                # Dashboards
└── scripts/
    ├── deploy.sh               # Deployment automation
    └── health_check.sh         # Health monitoring

Configuration Hierarchy

Environment Variables: Runtime configuration (API keys, ports)
Config Files: Persistent configuration (config.json)
Code Defaults: Fallback values in source code
Feature Flags: Optional component enabling

This organized file structure provides a solid foundation for maintaining, extending, and deploying the chatbot evaluation system while ensuring code quality, performance, and maintainability.

File Structure

Chatbot Evaluation System - File Structure and Organization

Project Root Structure

Detailed File Organization

1. Core Application Files

app.py - Main Flask Application

runner.py - Evaluation Engine

2. Model Layer (models/)

models/qa.py - Model Registry and Providers

models/loader.py - Model Loading Utilities

3. Metrics Layer (metrics/)

Metrics Package Structure

metrics/ngram.py - BLEU/ROUGE/METEOR

metrics/semsim.py - Semantic Similarity

metrics/bert_score.py - BERT-based Scoring

metrics/bleurt.py - BLEURT Metric

metrics/comet.py - COMET Evaluation

metrics/toxicity.py - Toxicity Detection

metrics/latency.py - Performance Metrics

4. Utilities Layer (utils/)

utils/config.py - Configuration Management

utils/aggregator.py - Results Aggregation

5. Frontend Layer (templates/, static/)

templates/index.html - Main Interface

static/script.js - Frontend Logic

static/style.css - Styling

6. Configuration and Infrastructure

config.json - Runtime Configuration

requirements.txt - Dependencies

7. Testing and Validation (tests/, examples/)

Test Files:

Example Files:

8. Documentation and Deployment

README.md - Project Documentation

verify_install.py - Installation Verification

cli.py - Command Line Interface

Development Workflow

Code Organization Principles

File Naming Conventions

Import Structure

Deployment Organization

Production Deployment Structure

Configuration Hierarchy

`app.py` - Main Flask Application

`runner.py` - Evaluation Engine

2. Model Layer (`models/`)

`models/qa.py` - Model Registry and Providers

`models/loader.py` - Model Loading Utilities

3. Metrics Layer (`metrics/`)

`metrics/ngram.py` - BLEU/ROUGE/METEOR

`metrics/semsim.py` - Semantic Similarity

`metrics/bert_score.py` - BERT-based Scoring

`metrics/bleurt.py` - BLEURT Metric

`metrics/comet.py` - COMET Evaluation

`metrics/toxicity.py` - Toxicity Detection

`metrics/latency.py` - Performance Metrics

4. Utilities Layer (`utils/`)

`utils/config.py` - Configuration Management

`utils/aggregator.py` - Results Aggregation

5. Frontend Layer (`templates/`, `static/`)

`templates/index.html` - Main Interface

`static/script.js` - Frontend Logic

`static/style.css` - Styling

`config.json` - Runtime Configuration

`requirements.txt` - Dependencies

7. Testing and Validation (`tests/`, `examples/`)

`README.md` - Project Documentation

`verify_install.py` - Installation Verification

`cli.py` - Command Line Interface