Timeworx.io: Whitepaper
  • INTRODUCTION
    • Our Vision
    • Terms & Definitions
    • Data Growth
    • Data Processing in a Nutshell
    • The Problem
  • THE SOLUTION
    • Principles
    • Overview
    • An Example
    • Pipelines
    • Revenue Model
    • Customers
    • Agents
    • Machine Learning in a Nutshell
    • Objectives for the future of AI
  • AI that is Fair
    • Data Labelling in a Nutshell
    • Problems in Data Labelling
    • Decentralised Data Labelling
    • Cognitive Effort
    • Quality Assurance
    • Gamification
    • Our Mobile Application
  • AI that is privacy-enhancing
    • Data Privacy in AI
    • Federated Learning in a Nutshell
    • Federated Learning Protocol
  • AI that is Trusted
    • Trust in AI
    • Decentralised Inference Protocol
    • Performance Monitoring
    • Delegation of Trust
  • Token
    • Utility
    • Tokenomics
    • Additional Information
  • Roadmap
    • Roadmap
  • Team
    • Our Team
    • Our Advisors
  • Other Information
    • Keep in touch
    • Media Kit
    • Register for alpha testing
Powered by GitBook
On this page
  1. AI that is Fair

Problems in Data Labelling

PreviousData Labelling in a NutshellNextDecentralised Data Labelling

Last updated 1 year ago

When discussing the "black boxes" of AI, the focus is often on the technical aspects, such as the inability to discern why a machine makes certain decisions, leading to unease. However, there's another opaque aspect: data labelling. This grey area, a metaphorical black box let’s say, involves potential ethical issues, including the overworking and underpayment of workers.

As Forbes points out, the software industry's "dark side" is highlighted by the necessity of labels, known as "ground truth," for most AI applications, leading to the rise of a data labelling industry. This sector could be considered the , encompassing significant ethical considerations.

Datasets are collected from various online and external sources and are categorised based on their nature, type of data, and distinct features. Data labelling is the process of identifying raw data (like images, text files, or videos) and adding meaningful tags or labels to provide context so that a machine learning model can learn from it.

This process is crucial in AI and ML, as we have , because labelled data serves as the training set which teaches the algorithms to recognise patterns and make decisions. Without labelled data, a machine learning model would struggle to understand the input it's given or make accurate predictions. It's like trying to learn a new language without a dictionary; without understanding what each word means, it's way more difficult, though not impossible, to grasp the language's structure or communicate effectively.

The current approaches to data labelling face several critical issues, primarily stemming from their highly centralised and non-transparent nature. This centralised control often leads to a lack of transparency in how data is collected, labelled, and used, raising concerns about data privacy and ethical handling.

One of the most significant problems is the presence of considerable friction and barriers in the process, making it slow and inefficient. According to the AI Index report, one of the top barriers to scaling existing AI initiatives is the challenge of obtaining more data or inputs to train a model, cited by 44% of leaders. This difficulty in acquiring adequate data slows down the development of AI models and hampers innovation.

Furthermore, the issue of very low wages for data labellers is a significant concern. Data labelling is often outsourced to workers in low-income countries, who for repetitive and time-consuming tasks. This not only raises ethical concerns about fair labour practices but also affects the quality of data labelling. Low wages can lead to low morale and less incentive for labellers to ensure high accuracy, directly impacting the quality of AI models that rely on this labelled data.

It’s important to know these things and to understand the implications of improper and unethical practices. Well, it might be convenient - as the quotes “From the clients’ perspective, the invisibility of the workers in micro-tasking is not a bug but a feature”.

Let’s not forget that data labelling is an already very busy market with lots of big players, such as Amazon’s Mechanical Turk, Appen, LabelBox, Scale AI, Hive Micro, MIghty AI, Remotaks and many more.

Throughout the years, an increasing number of reports have been filed unfair compensation, inadequate working conditions and mistreatment of data labellers around the world, leading to the newly coined term of data colonialism. For reference, read this comprehensive by MIT Technology Review, which uses the example of crisis-stricken Venezuela as a cheap labour market, and offers an in-depth perspective on big companies’ practices.

blue-collar work of the future
previously mentioned
are paid minimal wages
Wired article
mentioning
article