Notes on Cross validation
Cross-validation is a statistical method of evaluating and comparing hypothesis functions by dividing data into two segments. One segment is used to induce the hypothesis function and the other segment used to test the hypothesis function1. Various cross-validation techniques exist, including hold-out cross-validation, -fold cross-validation and leave-one-out cross-validation.
To acheive a robust the empirical loss associated with a given hypothesis function one can use estimate of -fold cross-validation2. The empirical loss3 is a measure of how well the hypothesis function approximates the true function , calculated as,
where the loss function measures the extent with which the predicted class variable value differs from the actual class variable value . Many different loss functions exist4, such as the mean-squared-error loss function,
or the cross-entropy loss function,
In this dissertation, the 0-1 loss function is used to calculate the empirical loss associated with a hypothesis function presented,
Using cross-validation provides valuable insight into the stability of the empirical loss associated with a given hypothesis function. Empirical loss may vary due to the choice of the testing set; the choice of the training set; the internal randomness of the induction algorithm; as well as the random classification error due to having mislabelled objects in the testing data5.
In model selection, -fold cross-validation is employed to estimate the empirical loss associated with a given hypothesis function as summarized in Algorithm alg:modelselection. Several classifiers are trained, each with a different hyper-parameter configuration . The best hyper-parameter configuration is selected in order to minimize the empirical loss . During cross-validation the dataset is split into subsets each with an approximately equal number of example instances. The set with is divided equally into a training subset and validation subset . The subset is used for training the classier, the subset is used to calculate empirical loss of the induced classifier , while subset is not used. The hypothesis is trained and validated times, each time producing an empirical loss estimate. The individual empirical loss estimates are averaged to calculate the reported empirical loss for classifier .
[IN] Data set
[OUT] Best hyper-parameter configuration
[BEGIN ALGORITHM]
[DO] Randomly divide data set into partitions each having an approximately equal number of instances
[REPEAT]
[DO] Select a hyper-parameter configuration
[DO] Initialize to
[FOR] [TO]
[DO] Use temporary data set .
[DO] Partition temporary data set into a training set and a validation set .
[DO] Induce hypothesis over training set .
[DO] Estimate empirical loss over validation set .
[DO] .
[END FOR]
[DO]
[DO] Tabulate
[UNTIL] Hyper-parameter optimization is complete
[DO] Select hyper-parameter configuration with lowest [Note: This is ]
[END ALGORITHM]
After finding the best hyper-parameter configuration (the one with the lowest associated empirical loss), the empirical loss for hypothesis with is estimated using the set as training set, and the set as testing set as outlined in algorithm below. The hypothesis is trained and tested times, each time producing an empirical loss estimate. The individual empirical loss estimates are averaged to calculate the reported empirical loss for classifier .
[BEGIN ALGORITHM]
[IN] Data set
[OUT] Hyper-parameter configuration
[BEGIN ALGORITHM]
[DO] Estimate empirical loss for hypothesis with hyper-parameter configuration
[DO] Randomly divide data set into partitions each having an approximately equal number of instances
[DO] Initialize to
[FOR] [TO]
[DO] Use data set as training set and data set as testing set.
[DO] Induce hypothesis over training set .
[DO] Estimate empirical loss over testing set .
[DO].
[END FOR]
[DO]
[END ALGORITHM]
In this way, model selection is done independently from model evaluation, because is not used during model selection but to estimate the performance of the classifier only after a hyper-parameter configuration was selected. The phenomenon of peaking, where the same data set is used for both selecting and evaluating a hypothesis, is therefore avoided6.
Notes
- Originally published as part of Wilgenbus, E.F., 2013. The file fragment classification problem: a combined neural network and linearprogramming discriminant model approach. Masters thesis, North West University.
References
Footnotes
-
Refaeilzadeh, L., Tang, L., Liu, H., 2009. Cross-validation. Encyclopedia of Database Systems. Springer, United States. ↩
-
Bouckaert, R. R., 2003. Choosing between two learning algorithms based on calibrated tests. In: Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, United States of America, pp. 5158. ↩
-
See post on supervised learning ↩
-
Bishop, C. M., 1995. Neural Networks for Pattern Recognition. Oxford University Press. ↩
-
Kuncheva, L., 2004. Combining Pattern Classi ers: Methods and Algorithms. Wiley and Sons. ↩
-
Russel, S., Novig, P., 2010. Arti cial Intelligence: A modern approach, 3rd Edition. Pearson. [^note] By manual, grid or random search* ↩
Related tags