Delegation with Imperfect Recall: An Approach to AI Control
We study a scenario in which (i) a principal seeks to assess an agent’s alignment before deciding whether to delegate a task, but (ii) standard tests are compromised by the agent’s incentives to feign alignment. By obscuring whether tasks are real or part of a test, the principal induces uncertainty regarding the payoff-relevance of the agent’s choice. We model this instrument using games of imperfect recall and study a specific application to deceptive alignment, a potential challenge in the deployment of future situationally-aware AI systems. In our baseline model, the principal runs a testing episode before deciding whether to deploy the agent, and the agent cannot distinguish between testing and deployment. Leveraging this uncertainty, the principal achieves two goals: screening against misaligned agents and disciplining their behaviour. In equilibrium, adding multiple testing episodes reinforces both effects and can allow the principal to achieve her full-information payoff in finite time. We show that profitable deployment can be consistently sustained in equilibrium only with imperfect recall. With commitment, however, the principal can achieve strictly more than her full-information payoff in finite time and asymptotically achieve maximal disciplining—with or without recall. Our results are robust to the agent observing arbitrarily precise signals about its location.
Date: 6 December 2024, 12:45 (Friday, 8th week, Michaelmas 2024)
Venue: Manor Road Building, Manor Road OX1 3UQ
Venue Details: Seminar Room G
Speaker: Eric Chen (GPI Oxford)
Organising department: Department of Economics
Part of: Student Research Workshop in Micro Theory
Booking required?: Not required
Audience: Members of the University only
Editor: Edward Clark