Notes on Cross validation

Cross-validation is a statistical method of evaluating and comparing hypothesis functions by dividing data into two segments. One segment is used to induce the hypothesis function and the other segment used to test the hypothesis function1. Various cross-validation techniques exist, including hold-out cross-validation, kk-fold cross-validation and leave-one-out cross-validation.

To acheive a robust the empirical loss associated with a given hypothesis function one can use estimate of kk-fold cross-validation2. The empirical loss3 is a measure of how well the hypothesis function hh approximates the true function ff, calculated as,

Remp(h)=1Nt=1NL(y(t),h(x(t)))R_{emp}(h) = \frac{1}{N} \sum_{t=1}^N L(y^{(t)},h(\vec{x}^{(t)}))

where the loss function L(y,h(x))L(y,h(\vec{x})) measures the extent with which the predicted class variable value y^=h(x)\hat{y}=h(\vec{x}) differs from the actual class variable value yy. Many different loss functions exist4, such as the mean-squared-error loss function,

L(y,y^=h(x))=(yy^)2L(y,\hat{y}=h(\vec{x})) = (y -\hat{y})^2

or the cross-entropy loss function,

L(y,y^=h(x))=yln(y^y)+(1y)ln(1y^1y).L(y,\hat{y}=h(\vec{x}))= y\ln{(\frac{\hat{y}}{y})}+(1-y)\ln(\frac{1-\hat{y}}{1-y}).

In this dissertation, the 0-1 loss function is used to calculate the empirical loss associated with a hypothesis function presented,

L(y,y^=h(x))={0ifh(x)=y1ifh(x)y}.L(y,\hat{y}=h(\vec{x}))= \begin{Bmatrix} 0 & if & h(\vec{x}) = y \\ 1 & if & h(\vec{x}) \ne y \end{Bmatrix}.

Using cross-validation provides valuable insight into the stability of the empirical loss associated with a given hypothesis function. Empirical loss may vary due to the choice of the testing set; the choice of the training set; the internal randomness of the induction algorithm; as well as the random classification error due to having mislabelled objects in the testing data5.

In model selection, kk-fold cross-validation is employed to estimate the empirical loss associated with a given hypothesis function as summarized in Algorithm alg:modelselection. Several classifiers are trained, each with a different hyper-parameter configuration λ\vec{\lambda}. The best hyper-parameter configuration λ\vec{\lambda}^* is selected in order to minimize the empirical loss Remp(hλ)R_{emp}(h^{\vec{\lambda}}). During cross-validation the dataset D{D} is split into kk subsets S1,S2,...,Sk{S_1},{S_2},...,{S_k} each with an approximately equal number of example instances. The set Ti=DSi{T_i} = {D} − {S_i} with i=1..ki=1..k is divided equally into a training subset Ti1T_{i1} and validation subset Ti2T_{i2}. The subset Ti1T_{i1} is used for training the classier, the subset Ti2T_{i2} is used to calculate empirical loss of the induced classifier hiλh_i^{\vec{\lambda}}, while subset SiS_i is not used. The hypothesis hλh^{\vec{\lambda}} is trained and validated kk times, each time producing an empirical loss estimate. The kk individual empirical loss estimates are averaged to calculate the reported empirical loss for classifier hλh^{\vec{\lambda}}.

[IN] Data set DD
[OUT] Best hyper-parameter configuration λ\vec{\lambda}^*
[BEGIN ALGORITHM]
[DO] Randomly divide data set DD into kk partitions S1,,SkS_1,\dots,S_k each having an approximately equal number of instances
[REPEAT]

[DO] Select a hyper-parameter configuration λ\vec{\lambda}
[DO] Initialize Remp(hλ)\overline{R}_{emp}(h^{\vec{\lambda}}) to 00
[FOR] i=0i=0 [TO] kk

[DO] Use temporary data set Ti=DSiT_i=D-S_i.
[DO] Partition temporary data set TiT_i into a training set Ti1T_{i1} and a validation set Ti2T_i2.
[DO] Induce hypothesis hiλh_i^{\vec{\lambda}} over training set Ti1T_{i1}.
[DO] Estimate empirical loss Remp(hiλ)R_{emp}(h_i^{\vec{\lambda}}) over validation set Ti2T_{i2}.
[DO] Remp(hλ)=Remp+Remp(hiλ)\overline{R}_{emp}(h^{\vec{\lambda}}) = \overline{R}_{emp} + R_{emp}(h_i^{\vec{\lambda}}).

[END FOR]
[DO] Remp(hλ)=Remp(hλ)/k\overline{R}_{emp}(h^{\vec{\lambda}}) = \overline{R}_{emp}(h^{\vec{\lambda}})/k
[DO] Tabulate Remp(hλ)\overline{R}_{emp}(h^{\vec{\lambda}})

[UNTIL] Hyper-parameter optimization is complete
[DO] Select hyper-parameter configuration λ\vec{\lambda} with lowest Remp(hλ)\overline{R}_{emp}(h^{\vec{\lambda}}) [Note: This is λ\vec{\lambda}^*]
[END ALGORITHM]

k-Fold cross validation for model selection

After finding the best hyper-parameter configuration λ\vec{\lambda}^* (the one with the lowest associated empirical loss), the empirical loss for hypothesis hλh^{\vec{\lambda}} with λ=λ\vec{\lambda} = \vec{\lambda}^* is estimated using the set TiT_i as training set, and the set SiS_i as testing set as outlined in algorithm below. The hypothesis hλh^{\vec{\lambda}} is trained and tested kk times, each time producing an empirical loss estimate. The kk individual empirical loss estimates are averaged to calculate the reported empirical loss for classifier hλh^{\vec{\lambda}}.

[BEGIN ALGORITHM]
[IN] Data set DD
[OUT] Hyper-parameter configuration λ\vec{\lambda}
[BEGIN ALGORITHM]
[DO] Estimate empirical loss for hypothesis hh with hyper-parameter configuration λ\vec{\lambda}
[DO] Randomly divide data set DD into kk partitions S1,,SkS_1,\dots,S_k each having an approximately equal number of instances [DO] Initialize Remp(hλ)\overline{R}_{\operatorname{emp}}(h^{\vec{\lambda}}) to 00
[FOR] i=0i=0 [TO] kk

[DO] Use data set Ti=DSiT_i=D-S_i as training set and data set SiS_i as testing set.
[DO] Induce hypothesis hiλh_i^{\vec{\lambda}} over training set TiT_{i}.
[DO] Estimate empirical loss Remp(hiλ)R_{\operatorname{emp}}(h_i^{\vec{\lambda}}) over testing set SiS_{i}.
[DO]Remp(hλ)=Remp+Remp(hiλ)\overline{R}_{emp}(h^{\vec{\lambda}}) = \overline{R}_{emp} + R_{emp}(h_i^{\vec{\lambda}}).

[END FOR]
[DO] Remp(hλ)=Remp(hλ)/k\overline{R}_{emp}(h^{\vec{\lambda}}) = \overline{R}_{emp}(h^{\vec{\lambda}})/k [END ALGORITHM]

kk-Fold cross validation for model evaluation

In this way, model selection is done independently from model evaluation, because SiS_i is not used during model selection but to estimate the performance of the classifier only after a hyper-parameter configuration was selected. The phenomenon of peaking, where the same data set is used for both selecting and evaluating a hypothesis, is therefore avoided6.

Notes

References

Footnotes

  1. Refaeilzadeh, L., Tang, L., Liu, H., 2009. Cross-validation. Encyclopedia of Database Systems. Springer, United States.

  2. Bouckaert, R. R., 2003. Choosing between two learning algorithms based on calibrated tests. In: Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, United States of America, pp. 5158.

  3. See post on supervised learning

  4. Bishop, C. M., 1995. Neural Networks for Pattern Recognition. Oxford University Press.

  5. Kuncheva, L., 2004. Combining Pattern Classi ers: Methods and Algorithms. Wiley and Sons.

  6. Russel, S., Novig, P., 2010. Arti cial Intelligence: A modern approach, 3rd Edition. Pearson. [^note] By manual, grid or random search*

Related tags

emerging technologydecision and data science