Problems in Data Labelling

When discussing the "black boxes" of AI, the focus is often on the technical aspects, such as the inability to discern why a machine makes certain decisions, leading to unease. However, there's another opaque aspect: data labelling. This grey area, a metaphorical black box let’s say, involves potential ethical issues, including the overworking and underpayment of workers.

As Forbes points out, the software industry's "dark side" is highlighted by the necessity of labels, known as "ground truth," for most AI applications, leading to the rise of a data labelling industry. This sector could be considered the blue-collar work of the future, encompassing significant ethical considerations.

Datasets are collected from various online and external sources and are categorised based on their nature, type of data, and distinct features. Data labelling is the process of identifying raw data (like images, text files, or videos) and adding meaningful tags or labels to provide context so that a machine learning model can learn from it.

This process is crucial in AI and ML, as we have previously mentioned, because labelled data serves as the training set which teaches the algorithms to recognise patterns and make decisions. Without labelled data, a machine learning model would struggle to understand the input it's given or make accurate predictions. It's like trying to learn a new language without a dictionary; without understanding what each word means, it's way more difficult, though not impossible, to grasp the language's structure or communicate effectively.

The current approaches to data labelling face several critical issues, primarily stemming from their highly centralised and non-transparent nature. This centralised control often leads to a lack of transparency in how data is collected, labelled, and used, raising concerns about data privacy and ethical handling.

One of the most significant problems is the presence of considerable friction and barriers in the process, making it slow and inefficient. According to the AI Index report, one of the top barriers to scaling existing AI initiatives is the challenge of obtaining more data or inputs to train a model, cited by 44% of leaders. This difficulty in acquiring adequate data slows down the development of AI models and hampers innovation.

Furthermore, the issue of very low wages for data labellers is a significant concern. Data labelling is often outsourced to workers in low-income countries, who are paid minimal wages for repetitive and time-consuming tasks. This not only raises ethical concerns about fair labour practices but also affects the quality of data labelling. Low wages can lead to low morale and less incentive for labellers to ensure high accuracy, directly impacting the quality of AI models that rely on this labelled data.

It’s important to know these things and to understand the implications of improper and unethical practices. Well, it might be convenient - as the Wired article quotes “From the clients’ perspective, the invisibility of the workers in micro-tasking is not a bug but a feature”.

Let’s not forget that data labelling is an already very busy market with lots of big players, such as Amazon’s Mechanical Turk, Appen, LabelBox, Scale AI, Hive Micro, MIghty AI, Remotaks and many more.

Throughout the years, an increasing number of reports have been filed mentioning unfair compensation, inadequate working conditions and mistreatment of data labellers around the world, leading to the newly coined term of data colonialism. For reference, read this comprehensive article by MIT Technology Review, which uses the example of crisis-stricken Venezuela as a cheap labour market, and offers an in-depth perspective on big companies’ practices.

PreviousData Labelling in a Nutshell NextDecentralised Data Labelling

Last updated 1 year ago