When you’re building and training your machine learning models, it is always a good idea to test among different variations of your model, to pick out the one with the best configuration. To do this, you would require a huge dataset. This may be a problem, especially if you are limited in data size. You won’t have enough data points and may need to use the same data for both training and validation. A very helpful statistical method for tackling this problem is Cross-Validation.
Cross-validation is a method to sample or resample your training and validation datasets. While there are different ways to do cross-validation, in this short article we will be looking at how to do cross-validation using a specific method called K-fold. We will be considering K-fold because it is the most commonly used method for resampling.
Are you dealing with sets of data that do not have the same number of variables? Click here to learn about 6 Ways to Handle Class Imbalance.
2 Reasons Why is Cross-Validation a Good Idea.
Here are 2 major reasons why you may want to use Cross-Validation (CV) in your methodology:
- Using CV will ensure the efficient use of your data
When you build a model, you would normally have your train dataset separate from the validation dataset. Consider the case where your dataset is small in size, you may have to use the same dataset to both train and test your model, which causes poor results due to overfitting. So it is a good idea to cross-validate.
- Better Out-of-sample performance
Generally, cross-validation, if done correctly, will improve the performance of your model.
The K-Fold Cross-Validation
Now that we have defined Cross-validation, let us take a look at this specific type called K-fold. Like I stated earlier, we will be considering K-fold because it is the most commonly used method for resampling. The idea is to break your dataset into k different parts and for each iteration using a group to test and others to train.
A Step-by-Step Guide to K-Fold Validation
- Determine the value of k, (detailed explained in the next section.)
- Randomize your dataset.
- Split your dataset into k different groups (commonly called folds) of equal sizes.
- Select the first group
- Set the chosen group to be the validation dataset
- Train a new model on the other groups
- Test this model on the current validation dataset and record the score.
- Discard the model
- Move on to the next group.
- Go to step 5.
The first time you train, group 1 will be used for validation, the second time, group 2, up until the k-th group. Also, ensure that you discard each model after training and validation. What this means is that at some point in your model training, each data point will end up being used for validation only once.
How to Determine the value of K
The first step in the process is determining the value of k. Choosing this value correctly should help you build models with low bias. Typically, k is set equal to 5 or 10. For example, with scikit learn, the default value of k is 5. This will give you 5 groups.
Another method for choosing k is to set it to n, where n is the data size. This is called Leave-One-Out, described in the next section.
Ultimately, the correct value of k is dependent on your dataset and the problem you’re trying to solve. A rule of thumb is to pick a value of k that ensures your train dataset is similar in distribution to the original dataset.
Variations of K-Fold
There are many variations of K-Fold, three of which are:
- Group K-Fold: This ensures that the same group is not represented in both testing and training sets.
- Stratified K-Fold: This preserves the percentage of samples of each class.
- Leave One Out: Here, we set the value of k to n. We will have only one data point in the validation dataset.
You can check the scikit learn documentation for more variations.
Beyond Learning, Start Practicing.
Scikit learn has a cross-validation package to help you, it is a very simple process. The following code snippet is a simple example of how CV works in scikit learn:
import numpy as np
from sklearn.modelselection import KFold
X = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
kf = KFold(nsplits=2)
for train, test in kf.split(X):
print("TRAIN: %s TEST:%s" % (train, test))
This will produce 2 groups of dataset, for example:TRAIN: ["j", "e", "i", "g", "d"] TEST: ["b", "f", "c", "a", "h"]
TRAIN: ["b", "f", "c", "a", "h"] TEST: ["j", "e", "i", "g", "d"]
You can also check out the official scikit learn documentation for more information. Next time you’re training your models, you will be able to use this method to test the variation with the best configuration.
Have you tried the K-fold method before? How was your experience with it? Do you have more questions about it? Please send us a direct message here.