Notes on Model Evaluation and Selection

Jan, 11 2013.

A model evaluation measure is a quantitative statement of how well a particular classifier meets the goals of the supervised learning endeavour ¹. In this research project the performance of each hypothesised classifier is reported by means of several evaluation measures: the confusion matrix, classification accuracy, precision and recall. Each reported evaluation measure is estimated using 10-fold cross-validation². That is, an evaluation measure is calculated as the average of 10 individual evaluation measures estimated on independently trained and tested classifiers. The evaluation measures used in this dissertation are outlined in the next three subsections.

Confusion matrix

A confusion matrix is a matrix of size $n \times n$ , where $n$ is the number of classes considered in the classification problem, that shows the relation between the predicted class labels and actual class labels³. The confusion matrix is particularly useful to determine the extent of individual class misclassification. It might be possible that a classifier performs extremely well in correctly classifying instances belonging to one class, and extremely poorly in correctly classifying instances belonging to another class. The value of a classifier that performs well in the identification of classes and worse in the identification of other classes depends on the application domain.

A confusion matrix for a three-class classification problem is shown in Table below. The number of instances actually belonging to class 1 that was also predicted as belonging to class 1 is denoted $a$ ; the total number of instances in the testing set actually belonging to class 1 is denoted $t_{r1}=a+b+c$ ; while the total number of instances in the testing set predicted to belong to class 1 is denote $t_{c1} = a+d+g$ .

Actual\Predicted	Class 1	Class 2	Class 3	Total	Precision
Class 1	a	b	c	$t_{r1}$	$\pi_1$
Class 1	a	b	c	$t_{r1}$	$\pi_1$
Class 1	a	b	c	$t_{r1}$	$\pi_1$
Class 1	a	b	c	$t_{r1}$	$\pi_1$
Class 2	d	e	f	$t_{r2}$	$\pi_2$
Class 3	g	h	i	$t_{r3}$	$\pi_3$
Total	$t_{c1}$	$t_{c1}$	$t_{c1}$
Recall	$\rho_1$	$\rho_2$	$\rho_3$

Model evaluation criteria: Example of an confusion matrix

Precision and recall

There are four possible classification outcomes for every instance in the data set⁴. A classifier can assign an instance to class $i$ if it does in actual fact belong to class $i$ , this is counted as a true positive ( ${tp}_i$ ). If the classifier assigns the instance to class $i$ if it does not actually belong to class $i$ it is counted as a false positive ( ${fp}_i$ ). An instance not assigned to class $i$ that does not belong in class $i$ is counted as a true negative ( ${tn}_i$ ) whereas an instance that is not assigned to class $i$ that does belong to class $i$ is counted as a false positive ( ${fn}_i$ ).

For every class $i$ in the $n$ class classification problem a precision rate can be calculated as ⁵,

\pi_i = \frac{{tp}_i}{{tp}_i+{fp}_i}

and a recall rate can be calculated as,

\rho_i = \frac{{tp}_i}{{tp}_i+{fn}_i}

The precision rate summarizes a classifier's ability to identify a pattern as belonging to a class $i$ if it does belong to class $i$ . The precision rate for class 1 in the table can be calculated as $\pi_1=a/t_{r1}$ . The recall rate measures the probability that an instance classified as belonging to class $i$ is in fact a class $i$ instance. The recall rate for class 1 in table can be calculated as $\rho_1=a/t_{c1}$ . Similarly, precision and recall can be calculated for each class. The average precision and recall rate for a classifier is calculated as the average over all the individual class precision and recall rates.

Model accuracy

Accuracy, sometimes called the correct classification rate, can be calculated as the number of correct classifications made by the model over the test set, divided by the total number of instances in the test set⁶. From the confusion matrix in the Table above, the classification accuracy of the classifier can be calculated as $(a + e + i)/(a+b+\dots+i)$ .

The classification accuracy is related to the empirical loss (with a 0-1 loss function) estimated in cross-validation². The empirical loss measures the rate of misclassification whereas the classification accuracy $\overline{A}$ measures the rate of correct classification, formulated as,

\begin{align} \overline{A} &= 1 - \overline{R}_{emp}(h) \\ &= 1 - \underset{S_1,S_2,\dots,S_{10}}{\operatorname{mean}} R_{emp}(h) \\ &= 1 - \underset{S_1,S_2,\dots,S_{10}}{\operatorname{mean}} \frac{1}{|S_i|} \sum_{(\vec{x},y) \in S_i} \begin{Bmatrix} 0 &\text{if } h(\vec{x}) = y \\ 1 &\text{if } h(\vec{x}) \ne y \end{Bmatrix} \end{align}

Model comparison

After evaluating each hypothesized classifier built as part of the knowledge discovery endeavour, the performance measures can be compared to select the best model. In comparing the performance of two classifiers, the best classifier can be selected based on the perceived difference in the observed evaluation measures. Alternatively, a more scientific approach can be taken to statistically test the significance of the observed performance difference between two classifiers⁷. A statistical hypothesis test tries to determine whether one can expect the same results from a repeated study or whether the observed results can be attributed to chance.

To compare the performance of two classifiers $h^a$ and $h^b$ a $10$ -fold cross-validation $t$ -test is used to test the research hypothesis that the two classifiers perform equally⁸. That is, the null hypothesis states that the results obtained are purely random, whereas the alternative hypothesis states that the results are extraordinary. If the two algorithms are statistically different then the null hypothesis will be rejected in favour of the alternative hypothesis.

As stated already, the classification accuracy of an induced classifier is estimated using the $k$ -fold cross-validation. For each subset $S_i$ an accuracy measure $A^a_i = 1- R_{emp}(h_i^a)$ and $A^b_i = 1- R_{emp}(h_i^b)$ is estimated. The difference in the estimated accuracies of classifier $h^A$ and $h^B$ on subset $S_i$ is $d_i = A^a_i - A^b_i$ . The difference in the estimated classifier accuracies is assumed to be normally distributed with mean,

\overline{d}=\frac{1}{k}\sum_{i=1}^k d_i

and variance

\hat{\sigma_d}^2=\frac{\sum_{i=1}^k(d_i-\overline{d})^2}{k-1}

Under the null hypothesis $H_0: d = 0$ the test statistic,

t_d =\frac{1}{\sqrt{k}}\frac{\overline{d}}{\sqrt{\hat{\sigma}^2}}

has a Student's $t$ -distribution with $k-1$ degrees of freedom. From the Student's $t$ -distribution the probability $p$ of obtaining a test statistic greater than or equal to the the observed statistic, when assuming the null hypothesis is true, can be obtained. If the two-tailed $p$ -value is less than the chosen level of significance $\alpha$ then the null hypothesis $H_0$ can be rejected in favour of the two-tailed alternative hypothesis $H_A: p \ne 0$ . That is, reject the null hypothesis if $p(|t|\ge t_d | H_0)\le \alpha$ . The alternative hypothesis states that the accuracy of classifier $h^a$ and accuracy of classifier $h^b$ are statistically different from one another. Instead of explicitly specifying a level of significance, usually chosen as $\alpha=0.05$ , the $p$ -value can also be interpreted as the smallest significance level at which the null hypothesis will be rejected⁹.

As an alternative to the $10$ -fold cross-validation $t$ -test the McNemar's test, the $5\times2$ -fold cross-validation test and the $10\times10$ -fold cross-validation test were also considered. The McNemar's test is not suitable as it does not consider variability due to the choice of training set or due to the internal randomness in the induction algorithm [^book_kuncheva]. Neither the $10\times10$ -fold cross-validation test nor the $5\times2$ cross-validation test have been widely adopted in the data mining field, with the $10$ -fold cross-validation test remaining the most popular procedure¹⁰.

Notes

Published as part of Wilgenbus, E.F., 2013. The file fragment classification problem: a combined neural network and linearprogramming discriminant model approach. Masters thesis, North West University.

References

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI magazine 17 (3), 37. ↩
See post on cross-validation. ↩ ↩²
Kohavi, R., Provost, F., 1998. Glossary of terms. Machine Learning 30, 271274. ↩
Fawcett, T., 2006. An introduction to roc analysis. Pattern recognition letters 27 (8), 861874 ↩
Ozgur, A., Ozgur, L., Gungor, T., 2005. Text categorization with class-based and corpus-based key word selection. In: International Symposium on Computer and Information Sciences. Istanbul, Turkey, pp. 606615. ↩
Olson, D. L., Shi, Y., 2006. Introduction to Business Data Mining. Irwin-McGraw-Hill Series: Operations and Decision Sciences. McGraw-Hill. ↩
Kumar, R., 2005. Research Methodology: A Step-by-Step Guide for Beginners. SAGE Publications. ↩
See Salzberg, S. L., 1997. On comparing classi ers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery 1 (3), 317328. ALso see Bouckaert, R. R., 2003. Choosing between two learning algorithms based on calibrated tests. In: Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, United States of America, pp. 5158 ↩
Rice, J. A., 2007. Mathematical Statistics and Data Analysis. Duxbury Advanced Series. Brooks/Cole ↩
Refaeilzadeh, L., Tang, L., Liu, H., 2009. Cross-validation. Encyclopedia of Database Systems. Springer, United States. ↩