Hyper-Datafication

In our paper How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI (FAccT 2026), we define hyper-datafication as:

Hyper-datafication is the industrialised production and accumulation of data for AI model development across three coupled processes: (i) the large-scale collection and recombination of existing data sources, (ii) the use of AI systems to generate synthetic data, and (iii) the creation of purpose-built data whose primary function is to serve as training input for AI systems rather than direct human use.

This shift - from building models from data to actively creating data for models - is not an abstract trend. It is happening right now, with real environmental, economic, and social costs that often go unexamined. Occasionally, some of these stories make it into the news. This page is my attempt to keep track of them as they emerge. If you come across stories, feel free to send them my way.

๐Ÿคณ๐Ÿ‡ฎ๐Ÿ‡ณ No contracts and a two-thirds pay cut in less than two weeks for Bengaluru market vendors collecting AI training data

The News Minute โ€” June 2026

At the New Thippasandra market in Bengaluru, vendors are hired by companies such as Instawork to record egocentric video data for AI training. Workers report receiving no written contracts and only a verbal explanation of the arrangement. Within less than two weeks, pay reportedly dropped from 300 rupees (around USD 3) to 100 rupees (around USD 1) per hour without warning or explanation. Vendors are informed that the data may be used to train AI systems capable of performing tasks similar to their own. Yet for many, immediate financial necessity leaves little room to refuse work, even when they believe they may be helping to create systems that could eventually reduce demand for their labour.

๐Ÿงน๐Ÿ‡บ๐Ÿ‡ธ "Free" housecleaning in exchange for letting cleaners record your home

Forbes โ€” May 2026

The German start-up Shift is offering free housecleaning in New York City in exchange for allowing cleaners to record the session from a first-person perspective, which will then be used to train humanoid robots. A telling sign of just how valuable robot training data has become.

๐Ÿซ†๐Ÿ‡บ๐Ÿ‡ธ Meta logs employee keystrokes for AI training

Reuters โ€” April 2026

Meta is installing tracking software on US employees' computers to capture mouse movements, clicks, keystrokes, and occasional screenshots, feeding the data directly into their AI agent training pipeline. A spokesperson from Meta told the BBC that the sole purpose of the data collection is to train agents to perform computer-based tasks. In this way, employees' everyday work has become AI training data.

๐Ÿฆฟ๐ŸŒ At-home gig recording in more than 50 countries

MIT Technology Review โ€” April 2026

Gig workers in more than 50 countries collect data of everyday tasks. Despite pay being good relative to local standards, questions about privacy arise: how do you keep your private life off camera?

๐Ÿค–๐Ÿ‡จ๐Ÿ‡ณ State-sponsored robot training centres

Rest of World โ€” January 2026

In China, local governments are building robot training centres and hiring workers to perform mundane tasks to address a shortage of data for humanoid robots. More than 40 state-owned training centres have been announced. This is hyper-datafication operating at state level: purpose-built national infrastructure for training data production.

๐Ÿฆพ๐Ÿ‡ฎ๐Ÿ‡ณ Hand farms for robot training

Quasa โ€” November 2025

In Bengaluru, workers wearing head-mounted cameras spend hours folding towels, sorting utensils, and stacking boxes, following scripts with strict time limits and earning around USD 230โ€“250 per month. The footage is uploaded to AI labs in the US, where AI models use it to train robots to grasp, fold, and manipulate objects โ€” tasks these robots may eventually perform instead of humans. The workers carrying out these tasks are mostly in the Global South, while the economic and practical benefits of the resulting AI systems predominantly accrue in the Global North. This is a clear example of hyper-datafication, and of how its costs and benefits are distributed unevenly.