System Architecture
Chatbot Evaluation System - System Architecture Overview
System Architecture Design
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ USER INTERFACE LAYER │
├─────────────────────────────────────────────────────────────────┤
│ Web Browser ←→ Flask Web Server ←→ Model Providers │
│ (HTML/CSS/JS) (REST API) (OpenAI, Gemini, HF) │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION LOGIC LAYER │
├─────────────────────────────────────────────────────────────────┤
│ Request Processing │ Context Management │ Model Selection │
│ Response Generation│ Metric Computation │ Result Formatting │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ DATA PROCESSING LAYER │
├─────────────────────────────────────────────────────────────────┤
│ Text Processing │ Model Inference │ Metric Calculation │
│ Context Augmentation │ Response Sanitization │ Quality Analysis │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
├─────────────────────────────────────────────────────────────────┤
│ Model Caching │ Configuration Mgmt │ Logging & Monitoring │
│ Error Handling │ Resource Management │ Health Checks │
└─────────────────────────────────────────────────────────────────┘
Layered Architecture Breakdown
1. User Interface Layer
Frontend Components:
- Web Interface (templates/index.html): Main evaluation dashboard
- Static Assets (static/style.css, static/script.js): Styling and client-side logic
- API Client: JavaScript code for server communication
Key Features: - Model selection dropdown with provider badges - Question input with contextual hints - Advanced hyperparameters configuration - Real-time results display with metrics visualization - Provider request/response debugging interface
2. Application Logic Layer
Core Application (app.py):
# Flask application with modular route handlers
@app.route("/evaluate", methods=["POST"])
def evaluate_route():
# Request validation and preprocessing
# Model pipeline orchestration
# Response formatting and error handling
Evaluation Engine (runner.py):
def generate_and_evaluate(question, context, hyperparams, model_name):
# Small talk detection and routing
# Context augmentation
# Model selection and inference
# Response post-processing
# Metrics computation
# Result aggregation
Key Responsibilities: - Request validation and sanitization - Model pipeline orchestration - Context management and augmentation - Error handling and fallback mechanisms - Response formatting and delivery
3. Data Processing Layer
Model Interface (models/qa.py):
- Unified interface across all model providers
- Provider-specific wrapper implementations
- Model loading and caching strategies
Metrics Engine (metrics/):
- Text similarity computation (BLEU, ROUGE-L)
- Semantic analysis (sentence transformers)
- Quality assessment (BERTScore, BLEURT, COMET)
- Safety evaluation (toxicity detection)
Processing Pipeline:
Input Question → Text Preprocessing → Model Selection → Inference →
Response Post-processing → Metrics Calculation → Quality Validation →
Formatted Results
4. Infrastructure Layer
Caching System: - Hugging Face model cache management - Preloading strategies for performance - Memory-efficient model loading
Configuration Management: - Environment-based configuration - API key management and security - Feature flag system for optional components
Monitoring and Logging: - Structured logging with request tracing - Performance monitoring and alerting - Health check endpoints for operational monitoring
Component Architecture
Model Provider Abstraction
Design Pattern: Adapter Pattern
class ModelProvider:
def __init__(self, config):
self.config = config
self.client = self._initialize_client()
def generate_response(self, question, context, **kwargs):
# Provider-specific implementation
pass
def _initialize_client(self):
# Provider-specific client setup
pass
Supported Providers: 1. Hugging Face (Local): Direct model loading and inference 2. OpenAI (API): HTTP client with API key authentication 3. Google Gemini (API): SDK-based client with authentication 4. OpenRouter (API): Unified API for multiple model providers
Metrics Computation Architecture
Design Pattern: Strategy Pattern
class MetricsEngine:
def __init__(self):
self.strategies = self._load_strategies()
def compute_metrics(self, predictions, references, options):
results = {}
for metric_name in options.get('metrics', []):
strategy = self.strategies.get(metric_name)
if strategy:
results[metric_name] = strategy.compute(predictions, references)
return results
Metrics Categories: 1. N-gram Based: BLEU, ROUGE-L, METEOR (precision, recall, F1) 2. Semantic: Sentence transformers, BERTScore (contextual similarity) 3. Quality: BLEURT, COMET (neural evaluation metrics) 4. Safety: Toxicity detection (perspective API) 5. Performance: Latency, throughput measurements
Context Management System
Context Augmentation Pipeline:
class ContextManager:
def __init__(self):
self.generic_context = self._load_generic_context()
self.symptom_kb = self._load_symptom_knowledge_base()
def augment_context(self, question, base_context):
# Detect question type and intent
# Apply relevant context augmentation
# Merge with base context
return augmented_context
Knowledge Sources: 1. Generic Healthcare Context: Pre-loaded medical knowledge 2. Symptom-Specific KB: Targeted information for common symptoms 3. Dynamic Context: Real-time information retrieval (future enhancement)
Data Flow Architecture
Evaluation Request Flow
1. User submits question via web UI
↓
2. Frontend validates input and sends POST /evaluate
↓
3. Flask route handler validates request payload
↓
4. Small talk detection (fast path for simple queries)
↓
5. Context augmentation based on question analysis
↓
6. Model selection and pipeline retrieval
↓
7. Model inference with error handling and fallbacks
↓
8. Response sanitization and post-processing
↓
9. Parallel metrics computation across 8 metric types
↓
10. Results aggregation and formatting
↓
11. JSON response with comprehensive evaluation data
↓
12. Frontend renders results with visualizations
Model Loading Flow
1. Application startup triggers model preloading
↓
2. Extractive QA models loaded immediately (tokenizer + model)
↓
3. API-based model wrappers registered (lazy loading)
↓
4. Local chat models register lazy-loading wrappers
↓
5. Sentence transformers warmed for semantic similarity
↓
6. Optional heavy metrics preloaded based on environment flags
↓
7. Health check endpoint validates all components
Error Handling Flow
1. Model inference error detected
↓
2. Primary model marked as failed
↓
3. Fallback model selected from available pipelines
↓
4. Retry logic with exponential backoff for API models
↓
5. LLM fallback for low-confidence QA responses
↓
6. Graceful degradation with user-friendly error messages
↓
7. Comprehensive logging for debugging and monitoring
Scalability Architecture
Horizontal Scaling
- Stateless Design: No server-side session storage
- Container Ready: Docker-compatible architecture
- Load Balancing: Multiple application instances behind load balancer
Performance Scaling
- Model Caching: Preloaded models reduce cold-start latency
- Async Processing: Non-blocking metrics computation
- Resource Pooling: Shared model instances across requests
Data Scaling
- Efficient Storage: Minimal data retention for evaluation results
- Caching Strategy: In-memory caching for frequently used models
- Cleanup Policies: Automatic cache invalidation and memory management
Security Architecture
API Security
- Key Management: Secure storage of API keys in environment variables
- Request Validation: Input sanitization and size limits
- Rate Limiting: Protection against abuse and API quota exhaustion
Data Protection
- PII Handling: No storage of sensitive user information
- Secure Headers: HTTPS enforcement and security headers
- Audit Logging: Comprehensive request logging for security monitoring
Provider Security
- Credential Isolation: Separate API keys for different providers
- Request Signing: Secure authentication with model providers
- Response Validation: Sanitization of model responses before display
Deployment Architecture
Development Environment
Local Development Setup:
├── Python virtual environment
├── Local model cache (./hf_cache)
├── Environment variable configuration
└── Hot-reload development server
Production Environment
Production Deployment:
├── Docker containerization
├── Multi-stage model loading
├── Health check integration
├── Structured logging (JSON format)
├── Metrics collection (Prometheus)
└── Distributed tracing (OpenTelemetry)
Configuration Management
# Production Configuration Example
deployment:
replicas: 3
resources:
memory: "8Gi"
cpu: "2000m"
models:
preload_heavy_metrics: false
cache_dir: "/app/hf_cache"
max_model_memory: "6Gi"
monitoring:
health_check_interval: "30s"
metrics_retention: "7d"
log_level: "INFO"
This modular, layered architecture provides a robust foundation for chatbot evaluation while maintaining flexibility for future enhancements and scaling requirements.