Skip to content
All projects

2026 · Solo project · 2 min read

LLM Event Extraction Baseline

Zero-shot event extraction with Qwen2.5-7B-Instruct on MAVEN and WikiEvents. Compares unconstrained vs constrained-label prompting across trigger detection, type prediction, and argument extraction. A100 GPU inference via Hugging Face.

  • Python
  • Qwen2.5-7B
  • Hugging Face
  • PyTorch
  • A100

Goal

Establish a credible zero-shot baseline for event extraction with a 7B open-weights LLM, so future fine-tuned models have something honest to compare against.

Setup

  • Datasets: MAVEN (general-domain news, 168 event types) and WikiEvents (more structured argument schemas).
  • Model: Qwen2.5-7B-Instruct, run on Sheffield Stanage A100 GPUs via Hugging Face Transformers.
  • Two prompt families:
    • Unconstrained — model returns whatever JSON it likes given a definition.
    • Constrained-label — the prompt enumerates the closed set of allowed event types and the model picks from them.

What I measured

MetricWhy it matters
Valid JSON rateDid the model emit machine-parseable output at all?
Trigger accuracyDid it identify the trigger word correctly?
Type accuracyAmong triggers it found, did it pick the right type?
Combined accuracyEnd-to-end correctness on a single sentence.
Failure mode taxonomyWhich kinds of errors cluster together?

Findings (and honest caveats)

Constrained-label prompting roughly doubled type accuracy while only marginally hurting trigger recall. Most failures came from ambiguous near-synonym triggers and from the model producing extra speculative events the source sentence didn't license.

This is a baseline, not state of the art — the point was to make the fine-tuning experiments that come next interpretable.

What I'd do differently

Add a self-consistency vote across multiple samples, and build an out-of-domain test slice to stress-test the constrained-label setup against real-world noisy text.