All projects
2026 · Coursework + extension · 2 min read
Document QA Assistant
Extractive document question-answering pipeline using PDF text extraction, sentence-based chunking, RoBERTa-SQuAD2 inference, and answer evaluation scripts.
- Python
- Transformers
- RoBERTa
- PDF processing
- NLP
What it does
Answers natural-language questions against PDF documents using an extractive QA approach: pull text from the PDF, chunk it into sentence- level windows, run RoBERTa-SQuAD2 inference per chunk, and pick the highest-confidence span as the answer.
Pipeline
- PDF text extraction — convert source PDFs into clean per-page text.
- Chunking — split into sentence-based windows with overlap so an answer that straddles a boundary still scores.
- Extractive QA — RoBERTa-SQuAD2 inference per chunk; each chunk returns its best-confidence span and score.
- Aggregation — pick the highest-confidence span across all chunks.
- Evaluation — exact-match and F1 against a held-out question set, plus error analysis on the failures.
What this is and isn't
This is extractive QA, not full RAG: there's no embedding index, no retriever, no generator. The model returns spans copied directly from the source text. It's a clean, deterministic baseline for the harder generative-QA work the LLM event-extraction project moves toward.
Lessons
- Chunk boundaries matter more than chunk size. Overlap helps.
- Aggregating by raw confidence score works well; calibrating per-chunk scores against chunk length removed a class of false positives.
- An evaluation harness you actually run is worth ten you talk about.