# Problems in Data Labelling

When discussing the "black boxes" of AI, the focus is often on the technical aspects, such as the inability to discern why a machine makes certain decisions, leading to unease. However, there's another opaque aspect: data labelling. This grey area, a metaphorical black box let’s say, involves potential ethical issues, including the overworking and underpayment of workers.&#x20;

As Forbes points out, the software industry's "dark side" is highlighted by the necessity of labels, known as "ground truth," for most AI applications, leading to the rise of a data labelling industry. This sector could be considered the [blue-collar work of the future](https://www.bbc.com/news/technology-46055595), encompassing significant ethical considerations.

Datasets are collected from various online and external sources and are categorised based on their nature, type of data, and distinct features. Data labelling is the process of identifying raw data (like images, text files, or videos) and adding meaningful tags or labels to provide context so that a machine learning model can learn from it.

This process is crucial in AI and ML, as we have [previously mentioned](/the-solution/machine-learning-in-a-nutshell.md), because labelled data serves as the training set which teaches the algorithms to recognise patterns and make decisions. Without labelled data, a machine learning model would struggle to understand the input it's given or make accurate predictions. It's like trying to learn a new language without a dictionary; without understanding what each word means, it's way more difficult, though not impossible, to grasp the language's structure or communicate effectively.

The current approaches to data labelling face several critical issues, primarily stemming from their highly centralised and non-transparent nature. This centralised control often leads to a lack of transparency in how data is collected, labelled, and used, raising concerns about data privacy and ethical handling.

One of the most significant problems is the presence of considerable friction and barriers in the process, making it slow and inefficient. According to the AI Index report, one of the top barriers to scaling existing AI initiatives is the challenge of obtaining more data or inputs to train a model, cited by 44% of leaders. This difficulty in acquiring adequate data slows down the development of AI models and hampers innovation.

Furthermore, the issue of very low wages for data labellers is a significant concern. Data labelling is often outsourced to workers in low-income countries, who [are paid minimal wages](https://www.wired.com/story/millions-of-workers-are-training-ai-models-for-pennies/) for repetitive and time-consuming tasks. This not only raises ethical concerns about fair labour practices but also affects the quality of data labelling. Low wages can lead to low morale and less incentive for labellers to ensure high accuracy, directly impacting the quality of AI models that rely on this labelled data.

It’s important to know these things and to understand the implications of improper and unethical practices. Well, it might be convenient -  as the [Wired article](https://www.wired.com/story/millions-of-workers-are-training-ai-models-for-pennies/) quotes “From the clients’ perspective, the invisibility of the workers in micro-tasking is not a bug but a feature”.

Let’s not forget that data labelling is an already very busy market with lots of big players, such as Amazon’s Mechanical Turk, Appen, LabelBox, Scale AI, Hive Micro, MIghty AI, Remotaks and many more.

Throughout the years, an increasing number of reports have been filed [mentioning](https://aijourn.com/ais-race-to-the-bottom-why-we-can-no-longer-ignore-the-exploitative-practices-in-data-labeling/) unfair compensation, inadequate working conditions and mistreatment of data labellers around the world, leading to the newly coined term of data colonialism. For reference, read this comprehensive [article](https://www.technologyreview.com/2022/04/20/1050392/ai-industry-appen-scale-data-labels/) by MIT Technology Review, which uses the example of crisis-stricken Venezuela as a cheap labour market, and offers an in-depth perspective on big companies’ practices.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.timeworx.io/ai-that-is-fair/problems-in-data-labelling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
