Judging the performance of a machine learning algorithm


Judging the performance of a machine learning algorithm is vital to ensure confidence in future predictions. Machine learning is a method of analysis that allows us to create models using data. Unless we know how it will perform on new data, training a machine learning (ML) model is a pointless exercise. Having confidence in your algorithm’s performance is vital when making future predictions.

Here we will discuss a few techniques to measure error, and different types of errors.

The three primary ways to measure the effectiveness of a ML model are mean square/absolute error, a measure of correctness, or (for classifiers) various arithmetic combinations of true/false positives/negatives. 

Mean errors are a standard way to measure the accuracy of a regressor. This could include root mean square or mean absolute error (MSE and MAE). 


We can readily see that RMS error magnifies the value of large errors, which may be desirable for certain types of data (that is, when an error of 10 is “more than twice as bad” as an error of 5). Mean errors are a negatively-oriented scores, meaning smaller values are considered better. Conversely, scikit-learn implements a ‘score’ variable for its models that attempts to improve a model by maximizing the value of ‘score’.

Sci-kit learn’s ‘score’ variable is different for different machine learning models. It may be the coefficient of determination (known as R2 for linear regression), mean accuracy, or even -1*RMS (since the ‘score’ variable attempts to maximize the returned value for better fits). The goal for a ‘score’ variable is to be as high as possible.

There are many ways to measure the performance of a classification algorithm. Different techniques are appropriate at different times; for “nice” datasets accuracy is an effective metric.


This metric breaks down if you have unbalanced classes. For example, failure detection (low number of false positives) may benefit from maximizing specificity.

To see why, consider the case of searching for a rare occurrence. In the United States, the risk of a woman contracting breast cancer is about 12%. If we were given a dataset of mammogram images, and merely returned that none of them indicate that a cancerous tumor will develop, then we would be 88% accurate- which is actually pretty good for a lot of machine learning algorithms!

A more flexible set of metrics are precision and recall, which can be summarized by F-score.

Alternatively, the metrics for a classifier can be summarized by a confusion matrix.

Certain calculated values in a confusiton matrix may provide better insight into the performance of a machine learning algorithm.

This is an effective way to succinctly communicate a lot of information; for example, precision, recall, specificity, and F-Score can all be calculated with the information contained in a confusion matrix. While MSE, correctness, and confusion matrix values are effective in quantifying error, they do not inform strategies for improving the performance of a ML model.

To effectively reduce error, a data scientist must understand the source of error. Two major types of error are high bias and high variance errors. Here, we will define bias and variance, discuss the relationship between the two, and lastly relate them to overfitting and underfitting.

Bias is error that arises from erroneous assumptions about the model. This will result in a model that fails to make accurate predictions on training data (underfitting), as well as testing data.

Variance is error that arises from small fluctuations in data. A model may fit training data very well and thus have very lower error, but will not work well on test data (overfitting).