Skip to content
All projects

2026 · Coursework + extension · 2 min read

Document QA Assistant

Extractive document question-answering pipeline using PDF text extraction, sentence-based chunking, RoBERTa-SQuAD2 inference, and answer evaluation scripts.

  • Python
  • Transformers
  • RoBERTa
  • PDF processing
  • NLP

What it does

Answers natural-language questions against PDF documents using an extractive QA approach: pull text from the PDF, chunk it into sentence- level windows, run RoBERTa-SQuAD2 inference per chunk, and pick the highest-confidence span as the answer.

Pipeline

  1. PDF text extraction — convert source PDFs into clean per-page text.
  2. Chunking — split into sentence-based windows with overlap so an answer that straddles a boundary still scores.
  3. Extractive QA — RoBERTa-SQuAD2 inference per chunk; each chunk returns its best-confidence span and score.
  4. Aggregation — pick the highest-confidence span across all chunks.
  5. Evaluation — exact-match and F1 against a held-out question set, plus error analysis on the failures.

What this is and isn't

This is extractive QA, not full RAG: there's no embedding index, no retriever, no generator. The model returns spans copied directly from the source text. It's a clean, deterministic baseline for the harder generative-QA work the LLM event-extraction project moves toward.

Lessons

  • Chunk boundaries matter more than chunk size. Overlap helps.
  • Aggregating by raw confidence score works well; calibrating per-chunk scores against chunk length removed a class of false positives.
  • An evaluation harness you actually run is worth ten you talk about.