A data feminist examination of the data that exists and the data that is missing or has never been collected, taking in the radical counterdata work of W.E.B. Du Bois at the Exposition Universelle of 1900.
The meaning of a dataset
In 2016, artist and researcher Mimi Ọnụọha created the first iteration of The Library of Missing Datasets. The art installation is a filing cabinet containing drawers each stuffed full of individually labelled files, however, each file is empty.
Examples include:
“Undocumented immigrants currently incarcerated and/or underpaid”
“Measurements for global web users that take into account shared devices and VPNs”
“LGBT older adults discriminated against in housing”
“All extinct languages”
Datasets are representations of things that exist in the world. Their use ranges from the personal (e.g. data populated in a spreadsheet, managed by one person), to large-scale datasets such as LAION (a 5.85 billion image-text pair dataset scraped from the internet used by Stable Diffusion, a popular AI image generation tool) and Common Crawl (a 9.5-plus petabyte dataset comprising web crawl data dating back to 2008, used in the pre-training of ChatGPT-3).
“Wherever large amounts of data are collected, there are often empty spaces where no data live. The word “missing” is inherently normative. It implies both a lack and an ought: something does not exist, but it should... It’s in these things that we find cultural and colloquial hints of what is deemed important. Spots that we've left blank reveal our hidden social biases and indifferences.” — Mimi Onuoha on The Library of Missing Datasets
Our datasets are not as impartial or complete as we may first believe. Our understanding of how biased and discriminatory outcomes arise from skewed or missing data is growing, from the use of facial recognition and automated decision-making to algorithmically assisted smart systems.
The data we work with is just there, but do we ever ask how or why a dataset has been constructed? Whose viewpoints have influenced what is included or excluded? Why has society withheld the resources or lacked motivation to fill the gaps or missing data?
The Library of Missing Datasets reveals the flaws in how we approach data work, not only by drawing attention to the flaws in existing records but by also identifying data that has never existed and who that missing data might represent.
Technological impact of data gaps
The use of data in large-scale, automated systems should be understood not only from a data science perspective but also using a data feminist approach. Data feminism examines the power relations and methods used to gather the data, adopting a critical analysis of how data is defined and seeks to make all labour visible.1
Keep reading with a 7-day free trial
Subscribe to First & Fifteenth to keep reading this post and get 7 days of free access to the full post archives.