What is Cross Validation?
Cross-validation is a technique used to evaluate the performance of a machine learning model by partitioning the data into multiple subsets.
It involves training the model on some of these subsets and testing it on the remaining data, rotating the subsets to ensure every part of the data is used for both training and testing.
This approach helps in assessing how well the model generalizes to unseen data and reduces the risk of overfitting, especially when working with limited datasets.
By using most of the dataset for both training and validation, cross-validation provides a more reliable estimate of the model’s performance on real-world data.
Types of Cross Validation
There are different types of cross validation methods, and they could be classified into two broad categories – Non-exhaustive and Exhaustive Methods. We’re going to look at a few examples from both the categories.
Non-exhaustive Methods
Non-exhaustive cross validation methods, as the name suggests do not compute all ways of splitting the original data. Let us go through the methods to get a clearer understanding.
Holdout method
The most basic cross validation technique divides the entire dataset into two parts: training data and testing data.
As the name suggests, we train the model on training data and then evaluate it on the testing set. Usually, the training data size is more than twice that of the testing data, so the data is split in the ratio of 70:30 or 80:20.
In this approach, the data is first shuffled randomly before splitting. As the model is trained on a different combination of data points, the model can give different results every time we train it, which can cause instability. Also, we can only ensure that the train set we picked represents part of the dataset.
Also, when our dataset is not too large, the testing data may contain some critical information that we will lose as we do not train the model on the testing set. The holdout method is proper when you have a considerable dataset, are on a time crunch, or are starting to build an initial model in your data science project.
For example, you must build a machine-learning model that detects brain tumors using MRI images. You’ll first get your hands on a dataset of MRI images with their labels of having brain tumors or not.
Then, using the holdout method, you’ll divide the dataset into two portions: 70% of images go under the training set, and the remaining 30% go under the testing set. This way, you’ll train the model on 70% of images and then test its accuracy on 30% of labeled images.
K fold cross validation
K-fold cross validation is the cross validation technique that improves the holdout method. This method guarantees that the score of our model does not depend on how we picked the train and test set.
The data set is divided into k number of subsets, and each subgroup is made testing set at least once when it is repeated k number of times. Let us go through this in steps:
- Randomly split your entire dataset into k number of folds (subsets)
- For each fold in your dataset, train your model on the remaining folds of the dataset, which turns out to be k-1. Then, the model will be tested to check the effectiveness of this particular fold.
- Repeat steps 1 and 2 until every k-fold has served as the test set at least once during the whole model-building process.
- The average of all the accuracies for each k-fold is called the cross-validation accuracy. It will serve as the model’s performance metric.
This method is mainly preferred when we have a small dataset because it is less biased, as every data point in the dataset has a chance of appearing in the training or testing set at least once.
For example, suppose we have a small dataset of dermoscopic images representing different types of skin cancers. In that case, if we use the holdout method, we can detect some types easily.
Still, some types might not get detected, as their essential features were not used during training as they were residing in the testing set. Thus, using K folds here can benefit a lot as each image is going to be tested against all remaining k-1 edges, which means each feature will be trained and tested upon.
The only disadvantage of this method is that it is computationally expensive, as it runs the training algorithm k times and evaluates each k data point
Stratified K Fold Cross Validation
If we implement K folds on a classification dataset, we’ll randomly shuffle the data and divide it into k folds.
However, there are chances that some folds contain data specific to one particular class, making the folds imbalanced and our training biased.
For example, let us get a fold with the majority belonging to one class(say positive) and only a few as a negative class. This will undoubtedly ruin our training, and to avoid this, we make stratified folds using stratification.
Stratification is the process of rearranging the data to ensure that each fold represents the whole. For example, in a binary classification problem, there can only be two possibilities: a data point belongs to class A or B. So, in that case, we must arrange the data such that in every fold, each class consists of about half the instances in that fold.
Exhaustive Methods
Exhaustive cross validation techniques test all possible ways to divide the original sample into a training and a validation set by testing all the data points against one or more data points.
Leave-P-Out cross validation
When using this exhaustive method, we take p number of points out from the total number of data points in the dataset(say n). While training the model we train it on these (n – p) data points and test the model on p data points.
We repeat this process for all the possible combinations of p from the original dataset. Then to get the final accuracy, we average the accuracies from all these iterations.
This is an exhaustive method as we train the model on every possible combination of data points. Remember if we choose a higher value for p, then the number of combinations will be more and we can say the method gets a lot more exhaustive.
Leave-one-out cross validation
This is a simple variation of Leave-P-Out cross validation, where the p-value is set as one. We leave one data point for testing and train the model on all other data points. Once we test the data on the left-out data point, we repeat the process until all data points serve as a testing set at least once.
It is an exhaustive method and might produce variation if the left-out data point used for testing is an outlier, causing variations in accuracy.
For example, suppose we have a dataset of dermoscopic images representing skin cancer types. In that case, we can leave one image when training the model on the remaining images and then validate the model using the left-out image; we’ll touch upon all the data points, thus training and testing our model on each data point.
What is Rolling Cross Validation?
For time-series data the above-mentioned methods are not the best ways to evaluate the models. Here are two reasons as to why this is not an ideal way to go:
- Shuffling the data messes up the time section of the data as it will disrupt the order of events
- Using cross-validation, there is a chance that we train the model on future data and test on past data which will break the golden rule in time series i.e. “peaking in the future is not allowed”.
Keeping these points in mind we perform cross validation in this manner
- We create the fold (or subsets) in a forward-chaining fashion.
- Suppose we have a time series for stock prices for a period of n years and we divide the data yearly into n number of folds. The folds would be created like:
iteration 1: training [1], test [2] iteration 2: training [1 2], test [3] iteration 3: training [1 2 3], test [4] iteration 4: training [1 2 3 4], test [5] iteration 5: training [1 2 3 4 5], test [6] . . . iteration n: training [1 2 3 ….. n-1], test [n]
Here as we can see in the first iteration, we train on the data of the first year and then test it on 2nd year. Similarly in the next iteration, we train the on the data of first and second year and then test on the third year of data.
Note: It is not necessary to divide the data into years, I simply took this example to make it more understandable and easy.
FAQs
1. What is the purpose of cross validation?
The purpose of cross–validation is to test the ability of a machine learning model to predict new data. It is also used to flag problems like overfitting or selection bias and gives insights on how the model will generalize to an independent dataset.
2. How do you explain cross validation?
Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
3. What are the types of cross validation?
The 4 Types of Cross Validation in Machine Learning are:
- Holdout Method
- K-Fold Cross-Validation
- Stratified K-Fold Cross-Validation
- Leave-P-Out Cross-Validation
4. What is cross validation and why we need it?
Cross-Validation is a very useful technique to assess the effectiveness of a machine learning model, particularly in cases where you need to mitigate overfitting. It is also of use in determining the hyperparameters of your model, in the sense that which parameters will result in the lowest test error.
5. Does cross validation reduce Overfitting?
Cross-validation is a procedure that is used to avoid overfitting and estimate the skill of the model on new data. There are common tactics that you can use to select the value of k for your dataset.
This brings us to the end of this article where we learned about cross validation and some of its variants. To get a in-depth experience and knowledge about machine learning, take the free course from the great learning academy. Click the banner to know more.