Design-Based Supervised Learning: A General Framework for Using LLM Annotations and Other Predicted Variables in Downstream Analyses

OxTalks is Changing

OxTalks will soon move to the new Halo platform and will become 'Oxford Events.' There will be a need for an OxTalks freeze. This was previously planned for Friday 14th November – a new date will be shared as soon as it is available (full details will be available on the Staff Gateway).

In the meantime, the OxTalks site will remain active and events will continue to be published.

If staff have any questions about the Oxford Events launch, please contact halo@digital.ox.ac.uk

Design-Based Supervised Learning: A General Framework for Using LLM Annotations and Other Predicted Variables in Downstream Analyses

See more information at https://metrics-and-models.github.io/!

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. The recent advancements in large language models (LLMs) can lower costs for CSS research by annotating documents cheaply at scale, but such surrogate labels are often imperfect and biased. We present a new algorithm for using outputs from LLMs for downstream statistical analyses while guaranteeing statistical properties—like asymptotic unbiasedness and proper uncertainty quantification—which are fundamental to CSS research. We show that direct use of LLM-predicted surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80-90%. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased, without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. The is talk is based on joint work with Naoki Egami, Musashi Hinck, and Hanying Wei.

Date: 24 September 2025, 14:00
Venue: Kindly register to our mailing list to receive the Teams invitation!
Speaker: Brandon Stewart (Princeton University)
Organising department: Nuffield Department of Population Health
Organiser contact email address: metrics_and_models_mgmt@maillist.ox.ac.uk
Part of: Metrics and Models
Booking required?: Not required
Booking url: https://metrics-and-models.github.io/
Booking email: metrics_and_models-subscribe@maillist.ox.ac.uk
Audience: Public
This talk features in the following public collections:
- Talks of Interest to Medical Sciences
Editor: Richard Rahal