In machine learning (ml), model evaluation is a crucial step to understand how well your model performs on unseen data. This is essential because, while a model might perform well on the training dataset, its ability to generalize to new data determines its true value. Evaluating classification models involves several methods and metrics, each designed to give insights into different aspects of the model's performance.

Purpose of Model Evaluation

The main goal of model evaluation is to assess the quality of machine learning predictions and ensure that the model performs well on data it has not seen before (generalization). By evaluating ml models, we can:

Determine the effectiveness of the model in making accurate predictions.
Compare different models to choose the best one for a particular problem.
Fine-tune models to improve their performance by adjusting parameters or features.

Common Evaluation Procedures

There are several procedures for evaluating machine learning models:

Training and testing on the same data
- Rewards overly complex models that "overfit" the training data and won't necessarily generalize.
Train/test split
- Split the dataset into two pieces, so that the model can be trained and tested on different data
- Better estimate of out-of-sample performance, but still a "high variance" estimate
- Useful due to its speed, simplicity, and flexibility
K-fold cross-validation
- Systematically create "K" train/test splits and average the results together
- Even better estimate of out-of-sample performance
- Runs "K" times slower than train/test split

We can deduce from the above evaluation procedures that:

Training and testing on the same data is a classic cause of overfitting in which you build an overly complex model that won't generalize to new data and that is not actually useful.
Train_Test_Split provides a much better estimate of out-of-sample performance.
K-fold cross-validation does better by systematically creating “K” train test splits and averaging the results together.

The Choice of A Model Evaluation Metric.

The choice of a model evaluation metric depends on the specific machine learning problem you are solving. For classification problems, you can use the classification accuracy; but this has its own limitations. However, we will discuss the limitations of the classification accuracy and also focus on other important classification evaluation metrics in this guide.

Classification Accuracy and Its Limitations.

Accuracy is one of the simplest and most commonly used evaluation metrics, represented by the percentage of correct predictions made by the model. However, accuracy has its limitations, especially when dealing with imbalanced datasets, where one class is significantly more frequent than others. In such cases, a model might achieve high accuracy simply by always predicting the majority class, without actually learning meaningful patterns.

We've chosen the Pima Indians Diabetes dataset for this tutorial, which includes the health data and diabetes status of 768 patients. Let's read the data and print the first 5 rows of the data.

The label column indicates 1, if the patients has diabetes and 0, if the patients doesn't have diabetes. Albeit, we intend to answer the question:

Question: Can we predict the diabetes status of a patient given their health measurements?

With this in mind, let's define our features metrics X and response vector Y. We use train_test_split to split X and Y into training and testing set.

Next, we train a logistic regression model on training set. During the fit step, the logreg model object is learning the relationship between the X_train and Y_train. Finally we make a class predictions for the testing sets

Now , we've made prediction for the testing set, we can calculate the classification accuracy, which is simply the percentage of correct predictions.

However, anytime you use a classification accuracy as your evaluation metric, it is important to compare it with Null accuracy, which is the accuracy that could be achieved by always predicting the most frequent class

Null Accuracy

The Null accuracy answers the question; if my model was to predict the predominant class 100 percent of the time, how often will it be correct?

In the scenario above, 32% of the y_test are 1 (ones). In other words, a dumb model that predicts that the patients has diabetes, would be right 68% of the time(which is the zeros).This provides a baseline against which we might want to measure our logistic regression model.

When we compare the Null accuracy of 68% and the model accuracy of 69%, our model doesn't look very good. This demonstrates one weakness of classification accuracy as a model evaluation metric. The classification accuracy doesn't tell us anything about the underlying distribution of the testing test.

Let's look at the calculation of the null accuracy.

In summary:

The classification accuracy is the easiest classification metric to understand.
But, it does not tell you the underlying distribution of response values or predictions
And, it does not tell you what "types" of errors your classifier is making, this is why it is good to evaluate your models using the Confusion Matrix

Understanding Confusion Matrix And Its Advantage.

The confusion matrix is a table that describes the performance of a classification model. It provides a more detailed breakdown of a model’s performance, showing how often predictions fall into each category. It consists of four outcomes:

True Positives (TP): we correctly predicted that they do have diabetes; when both the actual and predicted values are 1
True Negatives (TN): we correctly predicted that they don't have diabetes; when both the actual and predicted values are 0
False Positives (FP): we incorrectly predicted that they do have diabetes; when the actual value is 0 but the predicted value is 1 (a "Type I error")
False Negatives (FN): we incorrectly predicted that they don't have diabetes; when the actual value is 1 but the predicted value is 0 (a "Type II error")

By using a confusion matrix, you can compute more nuanced metrics like precision, recall, and F1-score, which provide a clearer picture of a model’s performance in the presence of class imbalances.

Metrics Computed from a Confusion Matrix

Recall (Sensitivity): The proportion of correctly predicted positive cases out of all actual positives. It measures the model’s ability to identify positive cases. It means when the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

Specificity: The proportion of correctly predicted negative cases out of all actual negatives. This means when the actual value is negative, how often is the prediction correct?

How "specific" (or "selective") is the classifier in predicting positive instances? False Positive Rate: When the actual value is negative, how often is the prediction incorrect?

Precision: The proportion of correctly predicted positive cases out of all predicted positives. It indicates the accuracy of positive predictions. This means when a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances? Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

Conclusion

In conclusion, The choice of metric depends on your specific business objective, however, the confusion matrix gives you a more complete picture of how your classifier is performing. It also allows you to compute various classification metrics, and these metrics can guide your model selection.

Evaluating A Machine Learning Classification Model