Hyper-Datafication

In our paper How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI (FAccT 2026), we define hyper-datafication as:

Hyper-datafication is the industrialised production and accumulation of data for AI model development across three coupled processes: (i) the large-scale collection and recombination of existing data sources, (ii) the use of AI systems to generate synthetic data, and (iii) the creation of purpose-built data whose primary function is to serve as training input for AI systems rather than direct human use.

This shift - from building models from data to actively creating data for models - is not an abstract trend. It is happening right now, with real environmental, economic, and social costs that often go unexamined. Occasionally, some of these stories make it into the news. This page is my attempt to keep track of them as they emerge. If you come across stories, feel free to send them my way.

🫆🖥️ Meta logs employee keystrokes for AI training

Reuters — April 2026

Meta is installing tracking software on US employees' computers to capture mouse movements, clicks, keystrokes, and occasional screenshots, feeding the data directly into their AI agent training pipeline. A spokesperson from Meta told the BBC that the sole purpose of the data collection is to train agents to perform computer-based tasks. In this way, employees' everyday work has become AI training data.

🦿🌏 At-home gig recording in more than 50 countries

MIT Technology Review — April 2026

Gig workers in more than 50 countries collect data of everyday tasks. Despite pay being good relative to local standards, questions about privacy arise: how do you keep your private life off camera?

🤖🇨🇳 State-sponsored robot training centres

Rest of World — January 2026

In China, local governments are building robot training centres and hiring workers to perform mundane tasks to address a shortage of data for humanoid robots. More than 40 state-owned training centres have been announced. This is hyper-datafication operating at state level: purpose-built national infrastructure for training data production.

🦾🇮🇳 Hand farms for robot training

Quasa — November 2025

In Bengaluru, workers wearing head-mounted cameras spend hours folding towels, sorting utensils, and stacking boxes, following scripts with strict time limits and earning around USD 230–250 per month. The footage is uploaded to AI labs in the US, where AI models use it to train robots to grasp, fold, and manipulate objects — tasks these robots may eventually perform instead of humans. The workers carrying out these tasks are mostly in the Global South, while the economic and practical benefits of the resulting AI systems predominantly accrue in the Global North. This is a clear example of hyper-datafication, and of how its costs and benefits are distributed unevenly.