2026 · Solo project · 2 min read
LLM Event Extraction Baseline
Zero-shot event extraction with Qwen2.5-7B-Instruct on MAVEN and WikiEvents. Compares unconstrained vs constrained-label prompting across trigger detection, type prediction, and argument extraction. A100 GPU inference via Hugging Face.
- Python
- Qwen2.5-7B
- Hugging Face
- PyTorch
- A100
Goal
Establish a credible zero-shot baseline for event extraction with a 7B open-weights LLM, so future fine-tuned models have something honest to compare against.
Setup
- Datasets: MAVEN (general-domain news, 168 event types) and WikiEvents (more structured argument schemas).
- Model: Qwen2.5-7B-Instruct, run on Sheffield Stanage A100 GPUs via Hugging Face Transformers.
- Two prompt families:
- Unconstrained — model returns whatever JSON it likes given a definition.
- Constrained-label — the prompt enumerates the closed set of allowed event types and the model picks from them.
What I measured
| Metric | Why it matters |
|---|---|
| Valid JSON rate | Did the model emit machine-parseable output at all? |
| Trigger accuracy | Did it identify the trigger word correctly? |
| Type accuracy | Among triggers it found, did it pick the right type? |
| Combined accuracy | End-to-end correctness on a single sentence. |
| Failure mode taxonomy | Which kinds of errors cluster together? |
Findings (and honest caveats)
Constrained-label prompting roughly doubled type accuracy while only marginally hurting trigger recall. Most failures came from ambiguous near-synonym triggers and from the model producing extra speculative events the source sentence didn't license.
This is a baseline, not state of the art — the point was to make the fine-tuning experiments that come next interpretable.
What I'd do differently
Add a self-consistency vote across multiple samples, and build an out-of-domain test slice to stress-test the constrained-label setup against real-world noisy text.