LLM Event Extraction Baseline

Goal

Establish a credible zero-shot baseline for event extraction with a 7B open-weights LLM, so future fine-tuned models have something honest to compare against.

Setup

Datasets: MAVEN (general-domain news, 168 event types) and WikiEvents (more structured argument schemas).
Model: Qwen2.5-7B-Instruct, run on Sheffield Stanage A100 GPUs via Hugging Face Transformers.
Two prompt families:
- Unconstrained — model returns whatever JSON it likes given a definition.
- Constrained-label — the prompt enumerates the closed set of allowed event types and the model picks from them.

What I measured

Metric	Why it matters
Valid JSON rate	Did the model emit machine-parseable output at all?
Trigger accuracy	Did it identify the trigger word correctly?
Type accuracy	Among triggers it found, did it pick the right type?
Combined accuracy	End-to-end correctness on a single sentence.
Failure mode taxonomy	Which kinds of errors cluster together?

Findings (and honest caveats)

Constrained-label prompting roughly doubled type accuracy while only marginally hurting trigger recall. Most failures came from ambiguous near-synonym triggers and from the model producing extra speculative events the source sentence didn't license.

This is a baseline, not state of the art — the point was to make the fine-tuning experiments that come next interpretable.

What I'd do differently

Add a self-consistency vote across multiple samples, and build an out-of-domain test slice to stress-test the constrained-label setup against real-world noisy text.