Notes on Cross validation
Cross-validation is a statistical method of evaluating and comparing hypothesis functions by dividing data into two segments. One segment is used to induce the hypothesis function and the other segment used to test the hypothesis function1. Various cross-validation techniques exist, including hold-out cross-validation, -fold cross-validation and leave-one-out cross-validation.
To acheive a robust the empirical loss associated with a given hypothesis function one can use estimate of -fold cross-validation2. The empirical loss3 is a measure of how well the hypothesis function approximates the true function , calculated as,
where the loss function measures the extent with which the predicted class variable value differs from the actual class variable value . Many different loss functions exist4, such as the mean-squared-error loss function,
or the cross-entropy loss function,
In this dissertation, the 0-1 loss function is used to calculate the empirical loss associated with a hypothesis function presented,
Using cross-validation provides valuable insight into the stability of the empirical loss associated with a given hypothesis function. Empirical loss may vary due to the choice of the testing set; the choice of the training set; the internal randomness of the induction algorithm; as well as the random classification error due to having mislabelled objects in the testing data5.
In model selection, -fold cross-validation is employed to estimate the empirical loss associated with a given hypothesis function as summarized in Algorithm alg:modelselection. Several classifiers are trained, each with a different hyper-parameter configuration . The best hyper-parameter configuration is selected in order to minimize the empirical loss . During cross-validation the dataset is split into subsets each with an approximately equal number of example instances. The set with is divided equally into a training subset and validation subset . The subset is used for training the classier, the subset is used to calculate empirical loss of the induced classifier , while subset is not used. The hypothesis is trained and validated times, each time producing an empirical loss estimate. The individual empirical loss estimates are averaged to calculate the reported empirical loss for classifier .
[IN] Data set [OUT] Best hyper-parameter configuration