On 28th November OxTalks will move to the new Halo platform and will become 'Oxford Events' (full details are available on the Staff Gateway).
There will be an OxTalks freeze beginning on Friday 14th November. This means you will need to publish any of your known events to OxTalks by then as there will be no facility to publish or edit events in that fortnight. During the freeze, all events will be migrated to the new Oxford Events site. It will still be possible to view events on OxTalks during this time.
If you have any questions, please contact halo@digital.ox.ac.uk
In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. The recent advancements in large language models (LLMs) can lower costs for CSS research by annotating documents cheaply at scale, but such surrogate labels are often imperfect and biased. We present a new algorithm for using outputs from LLMs for downstream statistical analyses while guaranteeing statistical properties—like asymptotic unbiasedness and proper uncertainty quantification—which are fundamental to CSS research. We show that direct use of LLM-predicted surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80-90%. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased, without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. The is talk is based on joint work with Naoki Egami, Musashi Hinck, and Hanying Wei.