Notes on Model Evaluation and Selection

A model evaluation measure is a quantitative statement of how well a particular classifier meets the goals of the supervised learning endeavour 1. In this research project the performance of each hypothesised classifier is reported by means of several evaluation measures: the confusion matrix, classification accuracy, precision and recall. Each reported evaluation measure is estimated using 10-fold cross-validation2. That is, an evaluation measure is calculated as the average of 10 individual evaluation measures estimated on independently trained and tested classifiers. The evaluation measures used in this dissertation are outlined in the next three subsections.

Confusion matrix

A confusion matrix is a matrix of size n×nn \times n, where nn is the number of classes considered in the classification problem, that shows the relation between the predicted class labels and actual class labels3. The confusion matrix is particularly useful to determine the extent of individual class misclassification. It might be possible that a classifier performs extremely well in correctly classifying instances belonging to one class, and extremely poorly in correctly classifying instances belonging to another class. The value of a classifier that performs well in the identification of classes and worse in the identification of other classes depends on the application domain.

A confusion matrix for a three-class classification problem is shown in Table below. The number of instances actually belonging to class 1 that was also predicted as belonging to class 1 is denoted aa; the total number of instances in the testing set actually belonging to class 1 is denoted tr1=a+b+ct_{r1}=a+b+c; while the total number of instances in the testing set predicted to belong to class 1 is denote tc1=a+d+gt_{c1} = a+d+g.

Actual\Predicted Class 1 Class 2 Class 3 Total Precision
Class 1 a b c tr1t_{r1} π1\pi_1
Class 1 a b c tr1t_{r1} π1\pi_1
Class 1 a b c tr1t_{r1} π1\pi_1
Class 1 a b c tr1t_{r1} π1\pi_1
Class 2 d e f tr2t_{r2} π2\pi_2
Class 3 g h i tr3t_{r3} π3\pi_3
Total tc1t_{c1} tc1t_{c1} tc1t_{c1}
Recall ρ1\rho_1 ρ2\rho_2 ρ3\rho_3
Model evaluation criteria: Example of an confusion matrix

Precision and recall

There are four possible classification outcomes for every instance in the data set4. A classifier can assign an instance to class ii if it does in actual fact belong to class ii, this is counted as a true positive (tpi{tp}_i). If the classifier assigns the instance to class ii if it does not actually belong to class ii it is counted as a false positive (fpi{fp}_i). An instance not assigned to class ii that does not belong in class ii is counted as a true negative (tni{tn}_i) whereas an instance that is not assigned to class ii that does belong to class ii is counted as a false positive (fni{fn}_i).

For every class ii in the nn class classification problem a precision rate can be calculated as 5,

πi=tpitpi+fpi\pi_i = \frac{{tp}_i}{{tp}_i+{fp}_i}

and a recall rate can be calculated as,

ρi=tpitpi+fni\rho_i = \frac{{tp}_i}{{tp}_i+{fn}_i}

The precision rate summarizes a classifier's ability to identify a pattern as belonging to a class ii if it does belong to class ii. The precision rate for class 1 in the table can be calculated as π1=a/tr1\pi_1=a/t_{r1}. The recall rate measures the probability that an instance classified as belonging to class ii is in fact a class ii instance. The recall rate for class 1 in table can be calculated as ρ1=a/tc1\rho_1=a/t_{c1}. Similarly, precision and recall can be calculated for each class. The average precision and recall rate for a classifier is calculated as the average over all the individual class precision and recall rates.

Model accuracy

Accuracy, sometimes called the correct classification rate, can be calculated as the number of correct classifications made by the model over the test set, divided by the total number of instances in the test set6. From the confusion matrix in the Table above, the classification accuracy of the classifier can be calculated as (a+e+i)/(a+b++i)(a + e + i)/(a+b+\dots+i).

The classification accuracy is related to the empirical loss (with a 0-1 loss function) estimated in cross-validation2. The empirical loss measures the rate of misclassification whereas the classification accuracy A\overline{A} measures the rate of correct classification, formulated as,

A=1Remp(h)=1meanS1,S2,,S10Remp(h)=1meanS1,S2,,S101Si(x,y)Si{0if h(x)=y1if h(x)y}\begin{align} \overline{A} &= 1 - \overline{R}_{emp}(h) \\ &= 1 - \underset{S_1,S_2,\dots,S_{10}}{\operatorname{mean}} R_{emp}(h) \\ &= 1 - \underset{S_1,S_2,\dots,S_{10}}{\operatorname{mean}} \frac{1}{|S_i|} \sum_{(\vec{x},y) \in S_i} \begin{Bmatrix} 0 &\text{if } h(\vec{x}) = y \\ 1 &\text{if } h(\vec{x}) \ne y \end{Bmatrix} \end{align}

Model comparison

After evaluating each hypothesized classifier built as part of the knowledge discovery endeavour, the performance measures can be compared to select the best model. In comparing the performance of two classifiers, the best classifier can be selected based on the perceived difference in the observed evaluation measures. Alternatively, a more scientific approach can be taken to statistically test the significance of the observed performance difference between two classifiers7. A statistical hypothesis test tries to determine whether one can expect the same results from a repeated study or whether the observed results can be attributed to chance.

To compare the performance of two classifiers hah^a and hbh^b a 1010-fold cross-validation tt-test is used to test the research hypothesis that the two classifiers perform equally8. That is, the null hypothesis states that the results obtained are purely random, whereas the alternative hypothesis states that the results are extraordinary. If the two algorithms are statistically different then the null hypothesis will be rejected in favour of the alternative hypothesis.

As stated already, the classification accuracy of an induced classifier is estimated using the kk-fold cross-validation. For each subset SiS_i an accuracy measure Aia=1Remp(hia)A^a_i = 1- R_{emp}(h_i^a) and Aib=1Remp(hib)A^b_i = 1- R_{emp}(h_i^b) is estimated. The difference in the estimated accuracies of classifier hAh^A and hBh^B on subset SiS_i is di=AiaAibd_i = A^a_i - A^b_i. The difference in the estimated classifier accuracies is assumed to be normally distributed with mean,

d=1ki=1kdi\overline{d}=\frac{1}{k}\sum_{i=1}^k d_i

and variance

σd^2=i=1k(did)2k1\hat{\sigma_d}^2=\frac{\sum_{i=1}^k(d_i-\overline{d})^2}{k-1}

Under the null hypothesis H0:d=0H_0: d = 0 the test statistic,

td=1kdσ^2t_d =\frac{1}{\sqrt{k}}\frac{\overline{d}}{\sqrt{\hat{\sigma}^2}}

has a Student's tt-distribution with k1k-1 degrees of freedom. From the Student's tt-distribution the probability pp of obtaining a test statistic greater than or equal to the the observed statistic, when assuming the null hypothesis is true, can be obtained. If the two-tailed pp-value is less than the chosen level of significance α\alpha then the null hypothesis H0H_0 can be rejected in favour of the two-tailed alternative hypothesis HA:p0H_A: p \ne 0. That is, reject the null hypothesis if p(ttdH0)αp(|t|\ge t_d | H_0)\le \alpha. The alternative hypothesis states that the accuracy of classifier hah^a and accuracy of classifier hbh^b are statistically different from one another. Instead of explicitly specifying a level of significance, usually chosen as α=0.05\alpha=0.05, the pp-value can also be interpreted as the smallest significance level at which the null hypothesis will be rejected9.

As an alternative to the 1010-fold cross-validation tt-test the McNemar's test, the 5×25\times2-fold cross-validation test and the 10×1010\times10-fold cross-validation test were also considered. The McNemar's test is not suitable as it does not consider variability due to the choice of training set or due to the internal randomness in the induction algorithm [^book_kuncheva]. Neither the 10×1010\times10-fold cross-validation test nor the 5×25\times2 cross-validation test have been widely adopted in the data mining field, with the 1010-fold cross-validation test remaining the most popular procedure10.

Notes

References

Footnotes

  1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI magazine 17 (3), 37.

  2. See post on cross-validation. 2

  3. Kohavi, R., Provost, F., 1998. Glossary of terms. Machine Learning 30, 271274.

  4. Fawcett, T., 2006. An introduction to roc analysis. Pattern recognition letters 27 (8), 861874

  5. Ozgur, A., Ozgur, L., Gungor, T., 2005. Text categorization with class-based and corpus-based key word selection. In: International Symposium on Computer and Information Sciences. Istanbul, Turkey, pp. 606615.

  6. Olson, D. L., Shi, Y., 2006. Introduction to Business Data Mining. Irwin-McGraw-Hill Series: Operations and Decision Sciences. McGraw-Hill.

  7. Kumar, R., 2005. Research Methodology: A Step-by-Step Guide for Beginners. SAGE Publications.

  8. See Salzberg, S. L., 1997. On comparing classi ers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery 1 (3), 317328. ALso see Bouckaert, R. R., 2003. Choosing between two learning algorithms based on calibrated tests. In: Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, United States of America, pp. 5158

  9. Rice, J. A., 2007. Mathematical Statistics and Data Analysis. Duxbury Advanced Series. Brooks/Cole

  10. Refaeilzadeh, L., Tang, L., Liu, H., 2009. Cross-validation. Encyclopedia of Database Systems. Springer, United States.

Related tags

emerging technologydecision and data science