Training, testing and validation datasets
The division of the input data into training, testing and validation sets is crucial in the creation of robust machine learning algorithms. Firstly, machine learning algorithms require a training set to be trained on. Each iteration, it calculates the difference between the predicted and actual outcomes and refines the algorithms weightings accordingly to reduce this difference. The algorithm produced is tailored specifically to the training data set. To assess the generalisability of the final algorithm and its learnt parameters, it is tested on a separate testing data set.
The main issue that can arise is the overfitting of the algorithm to a specific training set. When this occurs, the algorithm will be accurate (low error rate between predicted and actual results) on the training data set, but highly inaccurate on the testing data set. To overcome this, another separate validation set is used. The purpose of this is to teach the algorithm with the training set and optimize the hyperparameters (i.e. the architecture, number of iterations and allowable error) based on the accuracy of the validation data set. Once a final model is created based on the training and validation test sets, it is applied to the testing set for a final unbiased evaluation.
The three data sets should be randomly divided at the commencement of the project with the ratio dependent on the specific project and total data size. A not uncommon starting point is a ratio of 60:20:20 for training, validation and testing sets, respectively.
Related Radiopaedia articles
Artificial intelligence
- artificial intelligence (AI)
- imaging data sets
- computer-aided diagnosis (CAD)
- natural language processing
- machine learning (overview)
- visualizing and understanding neural networks
- common data preparation/preprocessing steps
- DICOM to bitmap conversion
- dimensionality reduction
- scaling
- centering
- normalization
- principal component analysis
- training, testing and validation datasets
- augmentation
- loss function
-
optimization algorithms
- ADAM
- momentum (Nesterov)
- stochastic gradient descent
- mini-batch gradient descent
-
regularisation
- linear and quadratic
- batch normalization
- ensembling
- rule-based expert systems
- glossary
- activation function
- anomaly detection
- automation bias
- backpropagation
- batch size
- computer vision
- concept drift
- cost function
- confusion matrix
- convolution
- cross validation
- curse of dimensionality
- dice similarity coefficient
- dimensionality reduction
- epoch
- explainable artificial intelligence/XAI
- feature extraction
- federated learning
- gradient descent
- ground truth
- hyperparameters
- image registration
- imputation
- iteration
- jaccard index
- linear algebra
- noise reduction
- normalization
- R (Programming language)
- Python (Programming language)
- segmentation
- semi-supervised learning
- synthetic and augmented data
- overfitting
- transfer learning