OxTalks will soon move to the new Halo platform and will become 'Oxford Events.' There will be a need for an OxTalks freeze. This was previously planned for Friday 14th November – a new date will be shared as soon as it is available (full details will be available on the Staff Gateway).
In the meantime, the OxTalks site will remain active and events will continue to be published.
If staff have any questions about the Oxford Events launch, please contact halo@digital.ox.ac.uk
Speaker:
Valentin Hofmann is a final-year DPhil student at the University of Oxford and a research assistant at LMU Munich. His work broadly focuses on the intersection of natural language processing, linguistics, and computational social science, with specific interests in tokenization, socially and temporally aware language models, and graph-based methods. He has previously spent time as a research intern at DeepMind and as a visiting scholar at Stanford University.
Abstract:
Language models (LMs) like ChatGPT have achieved unprecedented levels of performance in natural language processing. One common characteristic of these models is that they segment text into a sequence of tokens from a fixed-size vocabulary, a step commonly referred to as tokenization.
In this talk, I will take a closer look at how linguistic properties of the tokenization impact how LMs process complex words (e.g., “superbizarre”). I will first give an overview of different forms of complex word processing in humans and AI systems. I will then present recent computational studies showing that the tokenization of LMs can lead to linguistically invalid segmentations (e.g., “superb-iza-rre”) that severely affect how LMs interpret complex words. Finally, I will discuss potential solutions of this problem.