How Do Language Models Like ChatGPT Process Complex Words?

OxTalks is Changing

OxTalks will soon be transitioning to Oxford Events (full details are available on the Staff Gateway). A two-week publishing freeze is expected to start before the end of Hilary Term to allow all future events to be migrated to the new platform. During this period, you will not be able to submit or edit events on OxTalks. The exact freeze dates will be confirmed on the Staff Gateway and via email to identified OxTalks users.

If you have any questions, please contact halo@digital.ox.ac.uk

How Do Language Models Like ChatGPT Process Complex Words?

You can also join remotely. See Teams link on the seminar webpage in booking url.

Speaker:

Valentin Hofmann is a final-year DPhil student at the University of Oxford and a research assistant at LMU Munich. His work broadly focuses on the intersection of natural language processing, linguistics, and computational social science, with specific interests in tokenization, socially and temporally aware language models, and graph-based methods. He has previously spent time as a research intern at DeepMind and as a visiting scholar at Stanford University.

Abstract:

Language models (LMs) like ChatGPT have achieved unprecedented levels of performance in natural language processing. One common characteristic of these models is that they segment text into a sequence of tokens from a fixed-size vocabulary, a step commonly referred to as tokenization.
In this talk, I will take a closer look at how linguistic properties of the tokenization impact how LMs process complex words (e.g., “superbizarre”). I will first give an overview of different forms of complex word processing in humans and AI systems. I will then present recent computational studies showing that the tokenization of LMs can lead to linguistically invalid segmentations (e.g., “superb-iza-rre”) that severely affect how LMs interpret complex words. Finally, I will discuss potential solutions of this problem.

Date: 21 February 2023, 14:30
Venue:
Wolfson College
Linton Road OX2 6UD
See location on maps.ox

Details: Seminar Room 3 - The Academic Wing
Speaker: Valentin Hofmann (University of Oxford)
Organising department: Wolfson College
Organisers: Prof. Antoniya Georgieva (University of Oxford), Dr. Yi Yin (Wolfson College, University of Oxford)
Organiser contact email address: yi.yin@wrh.ox.ac.uk
Part of: Oxford Cross-Disciplinary Machine Learning (OxfordXML) Research Cluster Seminar Series
Booking required?: Not required
Booking url: https://users.ox.ac.uk/~ndog0178/XML/xml_index.html
Cost: Free (cake, tea and coffee provided)
Audience: Public
This talk features in the following public collections:
- Talks of Interest to Medical Sciences
Editor: Yi Yin