Data preparation as part of the ML workflow
Data preparation accounts for an average of 43% of the effort in the ML workflow and is probably the most resource-intensive activity in the ML workflow. In comparison, model selection and creation account for only 17%. Data preparation is part of the data pipeline that ingests raw data and outputs data in a form that can be used for training an ML model as well as for prediction by a trained ML model.
Data preparation may include the following activities:
- Identification: the types of data to be used for training and predictions are identified. For example, for a self-driving car, this could include identifying the need for radar, video, and laser (LiDAR) data.
- Collecting: The data source is identified and the means to collect the data is determined. For example, this could include identifying the International Monetary Fund (IMF) as the source of financial data and the channels through which the data will be fed into the AI-based system.
The data collected may be in various forms (e.g., numeric, categorical, image, tabular, text, time series, sensor, geospatial, video, and audio).
Pre-processing of data.
- Cleansing: When erroneous data, duplicate data, or outliers are detected, they are either removed or corrected. In addition, missing data values can be replaced with estimated or guessed values (e.g., mean, median, and mode). Removal or anonymization of personal data can also be performed.
- Transformation: the format of given data is changed (e.g., decomposition of an address stored as a string into its constituent parts, omission of a field with a random identifier, conversion of categorical data into numerical data, change of image formats). Some of the transformations applied to numeric data include scaling to ensure that the same range is used.
For example, normalization involves scaling the data so that it has a mean of zero and a standard deviation of one. This normalization ensures that the data has a range between zero and one.
- Magnification: this is used to increase the number of samples in a data set. Magnification can also be used to include negative samples in the training data to increase robustness to negative attacks (see 9.1).
- Sampling: This involves selecting a portion of the entire available dataset so that patterns can be observed in the larger dataset. This is usually done to reduce the cost and time required to build the ML model.
Note that any preprocessing runs the risk of altering useful valid data or adding invalid data.
- Feature selection: a feature is an attribute/property that is reflected in the data. Feature selection involves choosing those features that are most likely to contribute to model training and prediction. In practice, this often means removing features that are not expected (or desired) to impact the resulting model. By removing irrelevant information (noise), feature selection can reduce overall training time, prevent overfitting (see Section 3.5.1), increase accuracy, and make models more generalizable.
- Feature extraction: This involves deriving informative and non-redundant features from the existing features. The resulting dataset is usually smaller and can be used to build an ML model with the same accuracy more cheaply and quickly.
In parallel with these data preparation activities, exploratory data analysis (EDA) is also typically performed to support the overall data preparation task. This includes performing data analysis to discover trends in the data and using data visualization to present data in a visual format by plotting trends in the data.
Although the above data preparation activities and sub-activities have been presented in a logical order, they may be arranged differently in different projects or used only in a subset. Some of the data preparation steps, such as data source identification, are performed only once and may be done manually. Other steps may be part of the operational data pipeline and typically work with live data. These tasks should be automated.
Data preparation challenges.
Some of the challenges associated with data preparation are:
- The need for knowledge of:
- The application domain.
- The data and its characteristics.
- The various techniques associated with data preparation.
- The difficulty of obtaining high quality data from different sources.
- The difficulty of automating the data pipeline and ensuring that the production data pipeline is both scalable and has adequate performance efficiency (e.g., the time required to process a data item).
- The costs associated with data preparation.
- Insufficient priority is given to checking for errors introduced into the data pipeline during data preparation.
- The introduction of sampling bias (see Section 2.4).
Training, validation and test data sets in the ML workflow
Logically, three sets of equivalent data (i.e., randomly selected from a single initial data set) are required to develop an ML model:
- A training dataset, which is used to train the model.
- A validation dataset, which is used to evaluate and then tune the model.
- A test data set (also called a holdout data set) that is used to test the tuned model.
When unlimited suitable data is available, the amount of data used in the ML workflow for training, evaluation, and testing usually depends on the following factors:
- The algorithm used to train the model.
- On the availability of resources such as RAM, disk space, processing power, network bandwidth, and available time.
In practice, because it is difficult to obtain sufficient suitable data, the training and validation data sets are often derived from a single combined data set. The test dataset is kept separately and not used in training. This is to ensure that the developed model is not affected by the test data and that the test results accurately reflect the quality of the model.
There is no optimal ratio for splitting the combined data set into the three individual data sets, but typical ratios that can be used as a guide range from 60:20:20 to 80:10:10 (training: validation: test). Splitting the data into these data sets is often done randomly unless the data set is small or there is a risk that the resulting data sets will not be representative of the expected operational data.
When limited data are available, splitting the available data into three data sets may result in insufficient data for effective training. To address this issue, the training and validation data sets can be combined (keeping the test data set separate) and then used to create multiple split combinations of this data set (e.g., 80% training / 20% validation). The data is then randomly assigned to the training and validation data sets. Training, validation, and tuning are performed using these multiple split combinations to create multiple tuned models, and the overall model performance can be calculated as an average over all runs. There are several methods for creating multiple split combinations, including split-test, bootstrap, K-fold cross-validation, and leave-one-out cross-validation.
Data Quality Issues
Typical quality issues related to the data in a dataset include those listed in the following table:
|Wrong data||The captured data was incorrect (e.g. due to a defective sensor) or was entered incorrectly (e.g. copy and paste error).|
|Incomplete data||Data values may be missing (for example, a field in a record may be empty, or the data may have been omitted for a certain time interval). There can be several reasons for incomplete data, including security issues, hardware hardware issues, and human error.|
|Mislabeled data||There are several possible reasons for mislabeling data (see Section 4.5.2).|
|Insufficient data||There is not enough data to detect patterns by the learning algorithms used Algorithms can be detected (note that the minimum amount of data required varies for different algorithms).|
|Data not pre-processed||Data should be preprocessed to ensure that they are clean, have a consistent format, and do not contain unwanted outliers (see Section 4.1).|
|Obsolete data||The data used for learning and forecasting should be as current as possible (e.g., using financial data that dates back several years may well produce inaccurate results).|
|Unbalanced data||Unbalanced data can result from inappropriate biases (e.g., based on race, gender, or ethnicity), inappropriate placement of sensors (e.g., facial recognition cameras cameras mounted at ceiling height), variations in the availability of datasets, and varying motivations of data providers.|
|Unfair data||Fairness is a subjective quality characteristic, but can often be determined. For To promote diversity or gender balance, for example, selected data may positively bias data toward minorities or disadvantaged groups (note that such data may be considered fair, but not balanced).|
|Duplicate data||Repeated data sets can overly influence the resulting ML model.|
|Irrelevant data||Data that is not relevant to the problem being addressed may influence the results and lead to a waste of resources.|
|Privacy issues||Any use of data should comply with relevant data protection laws (e.g. GDPR in relation to personal data of individuals in the European Union).|
|Security issues||Fraudulent or misleading data intentionally inserted into the training data Training data can lead to inaccuracy of the trained model.|
Data quality and its impact on the ML model
The quality of the ML model is highly dependent on the quality of the data set from which it is created.
Poor quality data can lead to both flawed models and flawed predictions.
The following categories of errors result from problems with data quality:
- Decreased accuracy: These deficiencies are caused by incorrect, incomplete, mislabeled, insufficient, outdated, or irrelevant data, as well as data that has not been preprocessed. For example, if the data is used to create a model of expected home prices, but the training data contains little or no data on single-family homes with sunrooms, then the predicted prices for that particular type of home would likely be inaccurate.
- Biased model: These errors are caused by incomplete, unbalanced, unfair, inconsistent, or duplicate data. For example, if the data for a particular characteristic is missing (e.g., if all medical data used to predict disease comes from subjects of a particular sex), this is likely to adversely affect the resulting model (unless the model is to be used only to make predictions for that sex).
- Impaired model: These shortcomings are due to privacy and security constraints. For example, privacy issues in the data could lead to security vulnerabilities that would allow attackers to recover information from the models, which in turn could lead to the disclosure of personal information.
Data labeling for supervised learning
Data labeling is the process of enriching unlabeled (or poorly labeled) data by adding labels so that it becomes suitable for use in supervised learning. Data labeling is a resource-intensive activity that reportedly consumes an average of 25% of the time in ML projects.
In its simplest form, data labeling can consist of placing images or text files in different folders based on their classes. For example, all text files with positive product reviews are placed in one folder and all negative reviews are placed in another folder. Labeling objects in images by drawing rectangles around them is another common labeling technique often referred to as annotation. More complex annotation may be required for labeling 3D objects or drawing bounding boxes around irregular objects. Data labeling and annotation are usually supported by tools.
Approaches to data labeling
Labeling can be done in a number of ways:
- Internally: Labeling is performed by developers, testers, or a team established for labeling within the organization.
- Outsourced: Labeling is performed by an external specialized organization.
- Crowdsourced: Labeling is performed by a large group of people. Since it is difficult to control the quality of the labeling, multiple annotators may be asked to label the same data, and a decision is then made as to which labeling to use.
- AI-assisted: AI-assisted tools are used to recognize and annotate data or to cluster similar data. The results are then confirmed or possibly augmented (e.g., by changing the bounding box) by a human as part of a two-step process.
- Hybrid: A combination of the above labeling approaches could be used. For example, crowd labeling is typically managed by an external organization that has access to specialized AI-based crowd management tools.
Where appropriate, it may be possible to reuse a dataset that has already been labeled, eliminating the need for data labeling altogether. Many such datasets are publicly available, for example on Kaggle.
Mislabeled data in datasets.
In supervised learning, it is assumed that data has been correctly labeled by data annotators. However, in practice, it is rarely the case that all elements in a dataset are correctly labeled. Data are mislabeled for the following reasons:
- Random mistakes can be made by annotators (e.g., pressing the wrong key).
- Systematic errors can be made (e.g., annotators receive incorrect instructions or are poorly trained).
- Intentional errors can be made by malicious data annotators.
- Translation errors can cause correctly labeled data in one language to be mislabeled in another language.
- If the selection is open to interpretation, subjective judgments by data annotators can lead to conflicting data labels by different annotators.
- Lack of expertise can lead to incorrect labeling.
- Complex classification tasks can lead to a higher error rate.
- The tools used to support data labeling have shortcomings that lead to incorrect labels.
- ML-based approaches to labeling are probabilistic, which can lead to some incorrect labeling.