Big Data Ethics Forum: Fake news? Epistemic and ethical challenges for synthetic datasets

Synthetic data is data that are generated by an algorithm to have properties similar to real data. They may be useful whenever real data is too sensitive, too valuable, or too limited to meet research needs. Synthetic data may be used, for example, to teach population health science without releasing real patient data to students, or to develop statistical analyses while avoiding Gelman and Loken’s “garden of forking paths”. But the more closely synthetic data replicate the properties and patterns of real data, i.e. the more realistic they are, the greater the risk that they fail to achieve some of these objectives. Information contained in synthetic data could be used to learn about the real data on which they are based, potentially risking participant privacy or exhausting the utility of the real data for hypothesis testing. Conversely, attempts to reduce bias in certain machine learning models by augmenting real data with synthetic data may be defeated by a lack of realism. The ethical and epistemic problems that might motivate use of synthetic data, and the issues which may affect how it is generated and used, will be presented for discussion.