Data Processing in a Nutshell
Last updated
Last updated
In its raw form, data is not very useful to any individual or organisation in their attempts to make more informed decisions. By collecting the raw data, translating it into a usable form and manipulating the data, data scientists or data engineers are able to extract meaningful information from it, through a process called data processing.
Therefore, data processing is a combination of human and machine intelligence, through which a set of data inputs is transformed into a set of data outputs given an appropriate and relevant context. We can consider data inputs and outputs as data, facts, or any type of information that can be interpreted. By producing meaningful information and presenting it in a human readable form, such as graphs, charts, or statistics, members across organisations are able to understand and use the data to make more informed decisions.
The transformation from data inputs to data outputs is not a one-shot operation, rather it is a cycle in which processes are improved and more valuable insights are obtained. This data processing cycle is composed of a series of steps, executed in a specific order, and repeated and refined until the outcome is achieved:
Collection: raw data is accumulated and/or acquired from various sources that might be analytical (i.e., written documents), or electronic (i.e., from data centres, data lakes and data warehouses). Since this is the first step in the cycle, it is important that the data is obtained from a trustworthy source, otherwise the GIGO principle applies: “Garbage-in, Garbage-out”.
Preprocessing: after the raw data is collected, it enters a phase of preparation and clean-up, and it is organised to be passed on to the next steps. At this stage, the raw data is diligently checked for errors, incomplete or missing items, duplicates, and the data is prepared in a format that is more suitable for the later steps.
Ingestion: the cleaned data is entered into the information system that will be carrying out the processing and made available for the actual processing step. This is the first step in which data is transformed into a form that is able to produce meaningful information.
Processing: the input data is interpreted and manipulated towards obtaining the desired outcome: more meaningful information. Data can be either processed manually or computerised, aided by software, numeric algorithms, artificial intelligence and machine learning, depending on the intended outcome.
Interpretation: the processed data is analysed by a team of data scientists or data engineers, and is prepared to be delivered to non-data scientists. At this step, the data is translated into its final output format, either graphs, charts, statistics, videos, and can be used across the organisation in more informed decision processes.
Storage: lastly, the processed data is preserved in a storage facility to be used either immediately in data analytics processes, or to be fed in as an input in a higher level data processing cycle.
As can be seen, the end goal is not the data processing itself, rather we process data to obtain even more meaningful data, better data, smarter data. It is a continuous process through which more data is generated by all organisations towards solving bigger and more complex problems. In other words, data never sleeps.
There are three main types of data processing:
Manual: a complete human-in-the-loop process in which the entire data processing is executed with human intervention and with little-to-none technical innovation. This type of data processing is still very much needed, since there is a vast amount of tasks in which people are currently much better than computers. The downside is that this type of data processing is highly susceptible to errors, and is very expensive.
Mechanical (Automatic): in an effort to reduce human errors and labour, mechanical automations can be put in place for processing data with higher speed and accuracy. However, the lifetime of such machines and devices can prove to be quite short, as they become obsolete and need to be replaced with units that are able to do more advanced processing.
Electronic (Computerised): data processing is carried out by specialised software, artificial intelligence or machine learning, depending on the desired outcome. This type of data processing is the fastest, the most reliable and the most accurate, but it is however the most expensive to achieve.
There is always a compromise between speed, accuracy and cost, but the goal is to build data processing systems that are able to scale and generate smarter and better data.
Relying on people for processing data is an outdated, and sometimes unreasonable method, but it is still required up until we can teach machines to do everything that us humans are capable of, and even more. Until we reach Artificial General Intelligence (AGI), we still require human intelligence to lead the way, thus creating an entanglement, or a virtuous cycle, between data processing and artificial intelligence:
On the one hand, data processing can only advance by transferring human intelligence into artificial intelligence through machine learning. On the other hand, machine learning requires enormous efforts in data processing to be able to drive innovation. With every machine learning model we create to automate a data processing workflow, we relieve humankind of this issue and people can start to focus on the next big problem that needs solving.
This entanglement between data processing and artificial intelligence has paved the way for the rise of automated data processing. Currently valued at $1.7 billion in 2024, the automated data processing market is expected to grow to $5 billion by 2030, with a robust Compound Annual Growth Rate (CAGR) of 20% over this period. This growth is fueled by the vast quantities of data that are being produced year over year, the rising demand for advanced analytics and business intelligence (BI), and, foremost, the advancement in AI and ML which offer long-term efficiency gains.
In line with these developments, the field of DevOps (Software Development Operations) is gradually shifting towards DataOps (Data Operations), a novel concept that introduces a holistic approach to managing the data value chain by combining Agile methodologies with automation and automated data processing and with data sharing initiatives. With a market valued at $3.9 billion in 2023, DataOps is expected to grow to $10.9 billion by 2028, with an astounding CAGR of 23$.