System Architecture

Chatbot Evaluation System - System Architecture Overview

System Architecture Design

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    USER INTERFACE LAYER                         │
├─────────────────────────────────────────────────────────────────┤
│  Web Browser ←→ Flask Web Server ←→ Model Providers           │
│  (HTML/CSS/JS)     (REST API)         (OpenAI, Gemini, HF)      │
└─────────────────────────────────────────────────────────────────┘
                                 │
┌─────────────────────────────────────────────────────────────────┐
│                 APPLICATION LOGIC LAYER                         │
├─────────────────────────────────────────────────────────────────┤
│  Request Processing │ Context Management │ Model Selection      │
│  Response Generation│ Metric Computation │ Result Formatting    │
└─────────────────────────────────────────────────────────────────┘
                                 │
┌─────────────────────────────────────────────────────────────────┐
│                  DATA PROCESSING LAYER                          │
├─────────────────────────────────────────────────────────────────┤
│  Text Processing │ Model Inference │ Metric Calculation       │
│  Context Augmentation │ Response Sanitization │ Quality Analysis │
└─────────────────────────────────────────────────────────────────┘
                                 │
┌─────────────────────────────────────────────────────────────────┐
│                INFRASTRUCTURE LAYER                             │
├─────────────────────────────────────────────────────────────────┤
│  Model Caching │ Configuration Mgmt │ Logging & Monitoring     │
│  Error Handling │ Resource Management │ Health Checks         │
└─────────────────────────────────────────────────────────────────┘

Layered Architecture Breakdown

1. User Interface Layer

Frontend Components: - Web Interface (templates/index.html): Main evaluation dashboard - Static Assets (static/style.css, static/script.js): Styling and client-side logic - API Client: JavaScript code for server communication

Key Features: - Model selection dropdown with provider badges - Question input with contextual hints - Advanced hyperparameters configuration - Real-time results display with metrics visualization - Provider request/response debugging interface

2. Application Logic Layer

Core Application (app.py):

# Flask application with modular route handlers
@app.route("/evaluate", methods=["POST"])
def evaluate_route():
    # Request validation and preprocessing
    # Model pipeline orchestration
    # Response formatting and error handling

Evaluation Engine (runner.py):

def generate_and_evaluate(question, context, hyperparams, model_name):
    # Small talk detection and routing
    # Context augmentation
    # Model selection and inference
    # Response post-processing
    # Metrics computation
    # Result aggregation

Key Responsibilities: - Request validation and sanitization - Model pipeline orchestration - Context management and augmentation - Error handling and fallback mechanisms - Response formatting and delivery

3. Data Processing Layer

Model Interface (models/qa.py): - Unified interface across all model providers - Provider-specific wrapper implementations - Model loading and caching strategies

Metrics Engine (metrics/): - Text similarity computation (BLEU, ROUGE-L) - Semantic analysis (sentence transformers) - Quality assessment (BERTScore, BLEURT, COMET) - Safety evaluation (toxicity detection)

Processing Pipeline:

Input Question → Text Preprocessing → Model Selection → Inference →
Response Post-processing → Metrics Calculation → Quality Validation →
Formatted Results

4. Infrastructure Layer

Caching System: - Hugging Face model cache management - Preloading strategies for performance - Memory-efficient model loading

Configuration Management: - Environment-based configuration - API key management and security - Feature flag system for optional components

Monitoring and Logging: - Structured logging with request tracing - Performance monitoring and alerting - Health check endpoints for operational monitoring

Component Architecture

Model Provider Abstraction

Design Pattern: Adapter Pattern

class ModelProvider:
    def __init__(self, config):
        self.config = config
        self.client = self._initialize_client()

    def generate_response(self, question, context, **kwargs):
        # Provider-specific implementation
        pass

    def _initialize_client(self):
        # Provider-specific client setup
        pass

Supported Providers: 1. Hugging Face (Local): Direct model loading and inference 2. OpenAI (API): HTTP client with API key authentication 3. Google Gemini (API): SDK-based client with authentication 4. OpenRouter (API): Unified API for multiple model providers

Metrics Computation Architecture

Design Pattern: Strategy Pattern

class MetricsEngine:
    def __init__(self):
        self.strategies = self._load_strategies()

    def compute_metrics(self, predictions, references, options):
        results = {}
        for metric_name in options.get('metrics', []):
            strategy = self.strategies.get(metric_name)
            if strategy:
                results[metric_name] = strategy.compute(predictions, references)
        return results

Metrics Categories: 1. N-gram Based: BLEU, ROUGE-L, METEOR (precision, recall, F1) 2. Semantic: Sentence transformers, BERTScore (contextual similarity) 3. Quality: BLEURT, COMET (neural evaluation metrics) 4. Safety: Toxicity detection (perspective API) 5. Performance: Latency, throughput measurements

Context Management System

Context Augmentation Pipeline:

class ContextManager:
    def __init__(self):
        self.generic_context = self._load_generic_context()
        self.symptom_kb = self._load_symptom_knowledge_base()

    def augment_context(self, question, base_context):
        # Detect question type and intent
        # Apply relevant context augmentation
        # Merge with base context
        return augmented_context

Knowledge Sources: 1. Generic Healthcare Context: Pre-loaded medical knowledge 2. Symptom-Specific KB: Targeted information for common symptoms 3. Dynamic Context: Real-time information retrieval (future enhancement)

Data Flow Architecture

Evaluation Request Flow

1. User submits question via web UI
   ↓
2. Frontend validates input and sends POST /evaluate
   ↓
3. Flask route handler validates request payload
   ↓
4. Small talk detection (fast path for simple queries)
   ↓
5. Context augmentation based on question analysis
   ↓
6. Model selection and pipeline retrieval
   ↓
7. Model inference with error handling and fallbacks
   ↓
8. Response sanitization and post-processing
   ↓
9. Parallel metrics computation across 8 metric types
   ↓
10. Results aggregation and formatting
    ↓
11. JSON response with comprehensive evaluation data
    ↓
12. Frontend renders results with visualizations

Model Loading Flow

1. Application startup triggers model preloading
   ↓
2. Extractive QA models loaded immediately (tokenizer + model)
   ↓
3. API-based model wrappers registered (lazy loading)
   ↓
4. Local chat models register lazy-loading wrappers
   ↓
5. Sentence transformers warmed for semantic similarity
   ↓
6. Optional heavy metrics preloaded based on environment flags
   ↓
7. Health check endpoint validates all components

Error Handling Flow

1. Model inference error detected
   ↓
2. Primary model marked as failed
   ↓
3. Fallback model selected from available pipelines
   ↓
4. Retry logic with exponential backoff for API models
   ↓
5. LLM fallback for low-confidence QA responses
   ↓
6. Graceful degradation with user-friendly error messages
   ↓
7. Comprehensive logging for debugging and monitoring

Scalability Architecture

Horizontal Scaling

Stateless Design: No server-side session storage
Container Ready: Docker-compatible architecture
Load Balancing: Multiple application instances behind load balancer

Performance Scaling

Model Caching: Preloaded models reduce cold-start latency
Async Processing: Non-blocking metrics computation
Resource Pooling: Shared model instances across requests

Data Scaling

Efficient Storage: Minimal data retention for evaluation results
Caching Strategy: In-memory caching for frequently used models
Cleanup Policies: Automatic cache invalidation and memory management

Security Architecture

API Security

Key Management: Secure storage of API keys in environment variables
Request Validation: Input sanitization and size limits
Rate Limiting: Protection against abuse and API quota exhaustion

Data Protection

PII Handling: No storage of sensitive user information
Secure Headers: HTTPS enforcement and security headers
Audit Logging: Comprehensive request logging for security monitoring

Provider Security

Credential Isolation: Separate API keys for different providers
Request Signing: Secure authentication with model providers
Response Validation: Sanitization of model responses before display

Deployment Architecture

Development Environment

Local Development Setup:
├── Python virtual environment
├── Local model cache (./hf_cache)
├── Environment variable configuration
└── Hot-reload development server

Production Environment

Production Deployment:
├── Docker containerization
├── Multi-stage model loading
├── Health check integration
├── Structured logging (JSON format)
├── Metrics collection (Prometheus)
└── Distributed tracing (OpenTelemetry)

Configuration Management

# Production Configuration Example
deployment:
  replicas: 3
  resources:
    memory: "8Gi"
    cpu: "2000m"

models:
  preload_heavy_metrics: false
  cache_dir: "/app/hf_cache"
  max_model_memory: "6Gi"

monitoring:
  health_check_interval: "30s"
  metrics_retention: "7d"
  log_level: "INFO"

This modular, layered architecture provides a robust foundation for chatbot evaluation while maintaining flexibility for future enhancements and scaling requirements.