Extractive Qa Models

Chatbot Evaluation System - Extractive QA Model Architecture

Overview

Extractive Question Answering (QA) models represent the first generation of neural QA systems that extract answers directly from provided context text rather than generating new responses. These models revolutionized how machines understand and answer questions by treating QA as a reading comprehension task.

Historical Background

Evolution of QA Systems

Rule-Based Systems (1960s-1990s): Early QA systems used pattern matching and knowledge bases
Information Retrieval (1990s-2000s): Systems retrieved relevant documents and extracted answers
Neural Networks (2010s): Deep learning approaches for end-to-end QA
Transformer Revolution (2017+): Attention-based models dominate QA tasks

SQuAD Dataset Impact

The Stanford Question Answering Dataset (SQuAD) published in 2016 became the benchmark that drove extractive QA research: - SQuAD 1.0: 100,000+ question-answer pairs from Wikipedia articles - SQuAD 2.0: Added unanswerable questions to test model robustness - Impact: Standardized evaluation and enabled direct model comparison

Model Architecture

DistilBERT QA Architecture

Model Specifications: - Base Model: DistilBERT-base-uncased (distilled from BERT-base) - Parameters: 66 million (40% fewer than BERT-base) - Layers: 6 encoder layers (vs. 12 in BERT-base) - Hidden Size: 768 dimensions - Attention Heads: 12

Architecture Diagram:

Input Question + Context
         ↓
    Token Embedding Layer
         ↓
    Position Embedding Layer
         ↓
    [Encoder Layer 1] → [Encoder Layer 2] → ... → [Encoder Layer 6]
         ↓
    Start Position Classifier    End Position Classifier
         ↓                           ↓
    Start Probabilities          End Probabilities
         ↓                           ↓
    Answer Span Extraction ←────── Argmax ───────→ Confidence Score

RoBERTa QA Architecture

Model Specifications: - Base Model: RoBERTa-base (optimized BERT training procedure) - Parameters: 125 million - Layers: 12 encoder layers - Hidden Size: 768 dimensions - Attention Heads: 12

Key Improvements over BERT: 1. Dynamic Masking: Changed from static to dynamic masking during training 2. Full-Sentences: Removed next sentence prediction objective 3. Larger Batch Sizes: Used larger mini-batches for training 4. Byte-Level BPE: More efficient tokenization

ALBERT QA Architecture

Model Specifications: - Base Model: ALBERT-base-v2 (lite BERT with parameter sharing) - Parameters: 11 million (91% reduction from BERT-base) - Layers: 12 encoder layers with factorized embedding parameterization - Hidden Size: 128 dimensions (factorized) - Attention Heads: 12

Key Innovations: 1. Factorized Embedding Parameterization: Separates vocabulary and hidden size 2. Cross-Layer Parameter Sharing: Reduces parameters while maintaining performance 3. Sentence Order Prediction: Improved inter-sentence coherence understanding

Technical Implementation Details

Input Processing Pipeline

Tokenization Process:

# Example tokenization for QA input
question = "What are the symptoms of diabetes?"
context = "Diabetes symptoms include frequent urination, excessive thirst, and fatigue."

# Combined input format: [CLS] question [SEP] context [SEP]
tokenizer_output = {
    'input_ids': [101, 2054, 2024, 1996, 8030, 1997, 14671, 102, 14671, 8030, ...],
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...],
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...]  # 0 for question, 1 for context
}

Model Forward Pass

Mathematical Formulation:

# For each token position i:
start_logits[i] = Linear(start_classifier_weights * encoder_output[i])
end_logits[i] = Linear(end_classifier_weights * encoder_output[i])

# During inference:
start_position = argmax(start_logits)
end_position = argmax(end_logits)

# Ensure valid span (start <= end) and reasonable length
if start_position <= end_position and end_position - start_position < max_length:
    answer = context[start_position:end_position+1]

Training Objective

Loss Function:

# Cross-entropy loss for start and end positions
loss = (cross_entropy_loss(start_logits, start_positions) +
        cross_entropy_loss(end_logits, end_positions)) / 2

# During training, 50% of the time predict impossible answers (SQuAD 2.0)
if is_impossible:
    start_position = end_position = 0  # [CLS] token represents impossible

Performance Characteristics

Accuracy Metrics

SQuAD 1.1 Leaderboard Performance: | Model | EM Score | F1 Score | Parameters | Training Data | |-------|----------|----------|------------|---------------| | DistilBERT | 79.1 | 86.9 | 66M | SQuAD | | RoBERTa-base | 88.3 | 94.6 | 125M | SQuAD + 10x more | | ALBERT-base-v2 | 89.3 | 94.8 | 11M | SQuAD + 10x more |

Computational Efficiency

Inference Speed (tokens/second on V100 GPU): - DistilBERT: ~2,800 tokens/sec (fastest) - RoBERTa-base: ~1,900 tokens/sec (balanced) - ALBERT-base-v2: ~1,600 tokens/sec (most efficient)

Memory Usage (batch size 1): - DistilBERT: ~480MB GPU memory - RoBERTa-base: ~650MB GPU memory - ALBERT-base-v2: ~320MB GPU memory (most memory efficient)

Use Cases and Applications

Healthcare Applications

Medical Question Answering:

# Example medical QA
question = "What are the common side effects of metformin?"
context = """
Metformin is used to treat type 2 diabetes. Common side effects include:
- Nausea and vomiting
- Diarrhea
- Stomach pain
- Loss of appetite
- Metallic taste in mouth
Rare but serious side effects include lactic acidosis.
"""

# Model extracts: "Nausea and vomiting, Diarrhea, Stomach pain, Loss of appetite, Metallic taste in mouth"

Symptom Analysis: - Input: Patient describes symptoms - Context: Medical knowledge base or patient history - Output: Relevant symptom information and recommendations

Advantages in Healthcare Context

Factual Accuracy: Extracts answers directly from verified medical sources
Source Attribution: Can point to specific parts of source documents
Controlled Responses: Cannot generate false information outside context
Explainability: Answer spans are traceable to source material

Integration with Our System

Model Loading Strategy

Preloading Process (models/qa.py):

def preload_qa_models():
    for model_name, config in SUPPORTED_MODELS.items():
        if config["provider"] == "hf_qa":
            # Load tokenizer and model
            tokenizer = AutoTokenizer.from_pretrained(config["model"])
            model = AutoModelForQuestionAnswering.from_pretrained(config["model"])

            # Create pipeline
            qa_pipelines[model_name] = pipeline(
                "question-answering",
                model=model,
                tokenizer=tokenizer
            )

Inference Pipeline Integration

Context Augmentation:

# Augment context with symptom-specific knowledge base
augmented_context = GENERIC_CONTEXT
for symptom, info in SYMPTOM_KB.items():
    if symptom in question.lower():
        augmented_context = f"{GENERIC_CONTEXT}\n\n{info}"
        break

# Run QA inference
result = qa_pipeline(question=question, context=augmented_context)

Post-Processing and Validation

Answer Sanitization:

def _sanitize_answer(text: str) -> str:
    # Remove duplicate lines
    # Extract first complete sentence
    # Ensure response is concise and readable
    # Validate against medical safety guidelines

Future Enhancements

Advanced Extractive Models

Long-range Context: Models that handle longer documents (16K+ tokens)
Multi-hop Reasoning: Models that can combine information from multiple sources
Table Understanding: Extractive QA from structured data (tables, forms)
Multi-modal: Extractive QA from text + images (medical scans, charts)

Domain Adaptation

Medical Pre-training: Models pre-trained on medical literature
Symptom-specific Fine-tuning: Specialized models for healthcare scenarios
Multi-lingual Support: Extractive QA in multiple languages
Adversarial Training: Robustness against out-of-domain questions

This comprehensive extractive QA architecture provides a solid foundation for accurate, source-attributable question answering in healthcare contexts, with room for continued advancement in model capabilities and domain specialization.