In this presentation the transformation of assessment development using AI, computational psychometric models, and engineering techniques will be discussed. The cycle of test construction, administration, and scoring is labor-intensive, time-consuming and costly, yet it is necessary to support high standards of validity evidence or even legislative requirements. We will hear about how AI generally, and large computational language models in particular, have penetrated various phases of the test development cycle, with a focus on item generation and test design for different skills and domains.
These generative AI-based language models consist of multi-layered neural networks. Generative Pre-trained Transformer 3 (GPT-3) is a well-known application of a machine learning model that was (pre)trained on a dataset of 500 billion words, and which can generate stories, blogs, news reports and chats that can be indistinguishable from those written by humans. The advantage of pre-trained language models such as GPT-3 is that, once they have been built with a vast corpus of data and learned the connections between words, they can then be tuned for various purposes with relatively little data. This can be useful, for instance, in automated item generation (AIG) and text generation. Rather than the older approach to AIG with rule-based templates where elements are swapped out, pre-trained language models can create original content (texts and items). Using computational psychometrics (a blend of AI and psychometrics—von Davier 2015; 2017) we can then produce estimates of item difficulties with very little pilot data (Attali et al, 2022).
These applications will be illustrated with the “Item Factory”, a newly launched intelligent system that uses human-in-the-loop AI for test development for the Duolingo assessments.