Extractive Qa Models
Chatbot Evaluation System - Extractive QA Model Architecture
Overview
Extractive Question Answering (QA) models represent the first generation of neural QA systems that extract answers directly from provided context text rather than generating new responses. These models revolutionized how machines understand and answer questions by treating QA as a reading comprehension task.
Historical Background
Evolution of QA Systems
- Rule-Based Systems (1960s-1990s): Early QA systems used pattern matching and knowledge bases
- Information Retrieval (1990s-2000s): Systems retrieved relevant documents and extracted answers
- Neural Networks (2010s): Deep learning approaches for end-to-end QA
- Transformer Revolution (2017+): Attention-based models dominate QA tasks
SQuAD Dataset Impact
The Stanford Question Answering Dataset (SQuAD) published in 2016 became the benchmark that drove extractive QA research: - SQuAD 1.0: 100,000+ question-answer pairs from Wikipedia articles - SQuAD 2.0: Added unanswerable questions to test model robustness - Impact: Standardized evaluation and enabled direct model comparison
Model Architecture
DistilBERT QA Architecture
Model Specifications: - Base Model: DistilBERT-base-uncased (distilled from BERT-base) - Parameters: 66 million (40% fewer than BERT-base) - Layers: 6 encoder layers (vs. 12 in BERT-base) - Hidden Size: 768 dimensions - Attention Heads: 12
Architecture Diagram:
Input Question + Context
↓
Token Embedding Layer
↓
Position Embedding Layer
↓
[Encoder Layer 1] → [Encoder Layer 2] → ... → [Encoder Layer 6]
↓
Start Position Classifier End Position Classifier
↓ ↓
Start Probabilities End Probabilities
↓ ↓
Answer Span Extraction ←────── Argmax ───────→ Confidence Score
RoBERTa QA Architecture
Model Specifications: - Base Model: RoBERTa-base (optimized BERT training procedure) - Parameters: 125 million - Layers: 12 encoder layers - Hidden Size: 768 dimensions - Attention Heads: 12
Key Improvements over BERT: 1. Dynamic Masking: Changed from static to dynamic masking during training 2. Full-Sentences: Removed next sentence prediction objective 3. Larger Batch Sizes: Used larger mini-batches for training 4. Byte-Level BPE: More efficient tokenization
ALBERT QA Architecture
Model Specifications: - Base Model: ALBERT-base-v2 (lite BERT with parameter sharing) - Parameters: 11 million (91% reduction from BERT-base) - Layers: 12 encoder layers with factorized embedding parameterization - Hidden Size: 128 dimensions (factorized) - Attention Heads: 12
Key Innovations: 1. Factorized Embedding Parameterization: Separates vocabulary and hidden size 2. Cross-Layer Parameter Sharing: Reduces parameters while maintaining performance 3. Sentence Order Prediction: Improved inter-sentence coherence understanding
Technical Implementation Details
Input Processing Pipeline
Tokenization Process:
# Example tokenization for QA input
question = "What are the symptoms of diabetes?"
context = "Diabetes symptoms include frequent urination, excessive thirst, and fatigue."
# Combined input format: [CLS] question [SEP] context [SEP]
tokenizer_output = {
'input_ids': [101, 2054, 2024, 1996, 8030, 1997, 14671, 102, 14671, 8030, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...] # 0 for question, 1 for context
}
Model Forward Pass
Mathematical Formulation:
# For each token position i:
start_logits[i] = Linear(start_classifier_weights * encoder_output[i])
end_logits[i] = Linear(end_classifier_weights * encoder_output[i])
# During inference:
start_position = argmax(start_logits)
end_position = argmax(end_logits)
# Ensure valid span (start <= end) and reasonable length
if start_position <= end_position and end_position - start_position < max_length:
answer = context[start_position:end_position+1]
Training Objective
Loss Function:
# Cross-entropy loss for start and end positions
loss = (cross_entropy_loss(start_logits, start_positions) +
cross_entropy_loss(end_logits, end_positions)) / 2
# During training, 50% of the time predict impossible answers (SQuAD 2.0)
if is_impossible:
start_position = end_position = 0 # [CLS] token represents impossible
Performance Characteristics
Accuracy Metrics
SQuAD 1.1 Leaderboard Performance: | Model | EM Score | F1 Score | Parameters | Training Data | |-------|----------|----------|------------|---------------| | DistilBERT | 79.1 | 86.9 | 66M | SQuAD | | RoBERTa-base | 88.3 | 94.6 | 125M | SQuAD + 10x more | | ALBERT-base-v2 | 89.3 | 94.8 | 11M | SQuAD + 10x more |
Computational Efficiency
Inference Speed (tokens/second on V100 GPU): - DistilBERT: ~2,800 tokens/sec (fastest) - RoBERTa-base: ~1,900 tokens/sec (balanced) - ALBERT-base-v2: ~1,600 tokens/sec (most efficient)
Memory Usage (batch size 1): - DistilBERT: ~480MB GPU memory - RoBERTa-base: ~650MB GPU memory - ALBERT-base-v2: ~320MB GPU memory (most memory efficient)
Use Cases and Applications
Healthcare Applications
Medical Question Answering:
# Example medical QA
question = "What are the common side effects of metformin?"
context = """
Metformin is used to treat type 2 diabetes. Common side effects include:
- Nausea and vomiting
- Diarrhea
- Stomach pain
- Loss of appetite
- Metallic taste in mouth
Rare but serious side effects include lactic acidosis.
"""
# Model extracts: "Nausea and vomiting, Diarrhea, Stomach pain, Loss of appetite, Metallic taste in mouth"
Symptom Analysis: - Input: Patient describes symptoms - Context: Medical knowledge base or patient history - Output: Relevant symptom information and recommendations
Advantages in Healthcare Context
- Factual Accuracy: Extracts answers directly from verified medical sources
- Source Attribution: Can point to specific parts of source documents
- Controlled Responses: Cannot generate false information outside context
- Explainability: Answer spans are traceable to source material
Integration with Our System
Model Loading Strategy
Preloading Process (models/qa.py):
def preload_qa_models():
for model_name, config in SUPPORTED_MODELS.items():
if config["provider"] == "hf_qa":
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(config["model"])
model = AutoModelForQuestionAnswering.from_pretrained(config["model"])
# Create pipeline
qa_pipelines[model_name] = pipeline(
"question-answering",
model=model,
tokenizer=tokenizer
)
Inference Pipeline Integration
Context Augmentation:
# Augment context with symptom-specific knowledge base
augmented_context = GENERIC_CONTEXT
for symptom, info in SYMPTOM_KB.items():
if symptom in question.lower():
augmented_context = f"{GENERIC_CONTEXT}\n\n{info}"
break
# Run QA inference
result = qa_pipeline(question=question, context=augmented_context)
Post-Processing and Validation
Answer Sanitization:
def _sanitize_answer(text: str) -> str:
# Remove duplicate lines
# Extract first complete sentence
# Ensure response is concise and readable
# Validate against medical safety guidelines
Future Enhancements
Advanced Extractive Models
- Long-range Context: Models that handle longer documents (16K+ tokens)
- Multi-hop Reasoning: Models that can combine information from multiple sources
- Table Understanding: Extractive QA from structured data (tables, forms)
- Multi-modal: Extractive QA from text + images (medical scans, charts)
Domain Adaptation
- Medical Pre-training: Models pre-trained on medical literature
- Symptom-specific Fine-tuning: Specialized models for healthcare scenarios
- Multi-lingual Support: Extractive QA in multiple languages
- Adversarial Training: Robustness against out-of-domain questions
This comprehensive extractive QA architecture provides a solid foundation for accurate, source-attributable question answering in healthcare contexts, with room for continued advancement in model capabilities and domain specialization.