AI Key Terms: Data and Processing

By Jill Hubbard Bowman

As one AI engineer explained to me, data is water to AI. It’s essential. AI is all about types of data—from raw training input to algorithms to the model and the final output. And like water, input data can be contaminated and need cleaning before it’s ingested. Engineers use “pipelines” for data processing and model creation. Data may be put into containers. Data can leak and bad data can make your AI system sick.

Below are some definitions of types of data, including some definitions from the EU AI Act, as well as descriptions of data processing and training.

TYPES OF DATA

Data

A narrow meaning of 'data' is facts from measurements, without context. A broader meaning of 'data' is information. Data may be factual or creative. Data may be in the form of numbers, words, texts, images, photographs, video, or sound recordings.

Database

A database is an aggregation of data that is selected and arranged in a systematic manner. An example of a multimodal database for autonomous vehicles is a collection of images of street scenes from vision and range sensors such as cameras, lidar, and radars.

Data—Biometric

The EU Act defines ‘biometric data’ as “personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, such as facial images or dactyloscopic data.”

The Illinois Biometric Information Privacy Act defines “biometric information" as “any information, regardless of how it is captured, converted, stored, or shared, based on an individual's biometric identifier used to identify an individual. Biometric information does not include information derived from items or procedures excluded under the definition of biometric identifiers.”

"Biometric identifier" means a retina or iris scan, fingerprint, voiceprint, or scan of hand or face geometry. Biometric identifiers do not include writing samples, written signatures, photographs, human biological samples used for valid scientific testing or screening, demographic data, tattoo descriptions, or physical descriptions such as height, weight, hair color, or eye color. Biometric identifiers do not include donated organs, tissues, or parts as defined in the Illinois Anatomical Gift Act or blood or serum stored on behalf of recipients or potential recipients of living or cadaveric transplants and obtained or stored by a federally designated organ procurement agency. Biometric identifiers do not include biological materials regulated under the Genetic Information Privacy Act. Biometric identifiers do not include information captured from a patient in a health care setting or information collected, used, or stored for health care treatment, payment, or operations under the federal Health Insurance Portability and Accountability Act of 1996. Biometric identifiers do not include an X-ray, roentgen process, computed tomography, MRI, PET scan, mammography, or other image or film of the human anatomy used to diagnose, prognose, or treat an illness or other medical condition or to further validate scientific testing or screening.”

Data Label

A data label is a text annotation of a piece of data. Data labels may be aggregated in a computer file, which may be separate from the associated data set’s files.

Data Set

A data set is a collection of information that relates to a similar subject. An example of a large data set for computer vision training is ImageNet.

Data Preprocessing

Data preprocessing is the critical method of turning raw data into higher quality data before training so machine learning algorithms create better models and output. The steps include (1) cleaning the data, which fixes errors in the data, eliminates duplicates, and addresses outliers and bias; (2) data integration, which merges data from multiple sources in different formats; (3) data transformation, which puts the data into the right format for processing; (4) data reduction, which makes the dataset the right size. Careful preprocessing is critical to avoid unfairness or inaccuracies in the model from bias in the data.

Data Processing

Data processing is where machine learning techniques and algorithms are used over a large volume of data.

Postprocessing

Postprocessing is the method for modifying the output of the model after training to make it more accurate. It may include testing and retraining with new data.

Data—Synthetic

Synthetic data is information that has been generated by a computer to mimic information obtained in traditional ways to train AI models and improve their accuracy or reduce bias. The information may be sensitive (like faces of people) or hard to obtain (like driving scenarios).

Data—Testing

Testing data in a machine learning process for training an AI model is a smaller unseen subset of the main dataset. It is used at the end of the training process to check whether the model is working accurately.

The EU AI Act defines testing data as “data used for providing an independent evaluation of the AI system in order to confirm the expected performance of that system before its placing on the market or putting into service.”

Data—Training

Training data in a machine learning process is the larger subset of the main dataset that is used to train the model to detect and learn meaningful patterns. This data provides the examples the model uses for learning.

The EU AI Act defines training data as “data used for training an AI system through fitting its learnable parameters.”

Data—Validation

The EU AI Act defines ‘validation data’ as “data used for providing an evaluation of the trained AI system and for tuning its non-learnable parameters and its learning process, among other things, in order to prevent underfitting or overfitting; whereas the validation dataset is a separate dataset or part of the training dataset, either as a fixed or variable split.”

Inference

Inference is when the AI model puts what it has learned into practice—like generating new text to answer a question or identifying lane markings on a road. The AI model generalizes from the knowledge it has gained through training (encoded in weights connecting the neurons in its structure) to make predictions about new unseen inputs.

Information

Information may mean facts or data in context.

Input

Input for an AI system may be data sets for training or a prompt from a user to generate output.

The EU AI Act defines ‘input data’ as “data provided to or directly acquired by an AI system on the basis of which the system produces an output.”

Output

Output is generated from the complex computational processes of an AI system. Output may include predictions, recommendations, decisions, or content. Output may be in the form of sounds, images, or text, like computer code and clunky college essays. Output may be considered computer-generated computer material.

Machine Learning Pipeline

A machine learning pipeline is a workflow process for building, evaluating, and deploying machine learning models in an automated and standardized way.

Software (preprocessing, processing, and optimization)

In the AI context, there are numerous open-source software tools and libraries for developing, training, testing and optimization of neural network models, including PyTorch, TensorFlow, and OpenVINO.

Training

Training is feeding the AI system a large amount of data as the system self learns. The system follows the rules of its algorithm to analyze the data and make inferences and create output. The more data the system consumes, the better the system becomes at producing its output.

RESOURCES

Biometric Information Privacy Act (BIPA) (740 ILCS 14/1) Definitions https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004&ChapterID=57

Council of the European Union, “Proposal for a Regulation of the European Parliament and of the Council laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts,” Interinstitutional File: 2021/0106(COD) Brussels, 26 January 2024, https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf

IBM, “What is a machine learning pipeline?” https://www.ibm.com/topics/machine-learning-pipeline

Intel, “OpenVINO™ toolkit: An open-source AI toolkit that makes it easier to write once, deploy anywhere.” https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html

Google, “Create production-grade machine learning models with TensorFlow.” https://www.tensorflow.org/

The Linux Foundation, “A rich ecosystem of tools and libraries extends PyTorch and supports development in computer vision, NLP and more.” https://pytorch.org/

Merriam Webster Dictionary, Information, https://www.merriam-webster.com/dictionary/information

Toon, Nigel, How AI Thinks, (Transworld Publishers, 2024) p. 117 (distinguishing between facts, information, and knowledge)

This AI Law Maze Map blog is for education only. It is not intended as legal advice.

By using this website and information, you acknowledge and agree that no attorney-client relationship is created or implied.

AI Key Terms: Data and Processing

Sign up for our newsletter