The ROC Curve: An Overview
by Jiawen Huang
with Greg Page
Background: True Positive Rate and False Positive Rate
In classification modeling, a true positive is defined as a record whose actual outcome class is “positive” and which was predicted by the model to belong to the positive class. A model’s True Positive Rate (TPR), sometimes expressed as sensitivity, or as recall. is found by taking the number of true positives identified by the model, and dividing by the total number of records in the data whose actual outcome class was positive.
That felt like a whirlwind of terminology, so let’s use a confusion matrix to illustrate TPR with an example. In the image shown below, we can see that there are 98 (74+24) actual positive class outcomes in the data. Because 74 of those 98 records were successfully predicted by the model to belong to the positive class, this model’s TPR is 75.5%.
A false positive, on the other hand, occurs when a record is predicted to belong to the positive class, but whose actual outcome class label turns out to be negative. A model’s False Positive Rate (FPR) is found by taking the number of mistaken positive class predictions, and dividing that value by the total number of records whose actual outcome class was negative.
In the confusion matrix depicted above, there are 169 records whose actual outcome class was negative (151+18). Since 18 of those records were incorrectly predicted by the model to belong to the positive class, this model’s FPR is 10.65%.
Why do we care about TPR and FPR? For starters, these metrics can play a role in model diagnostics. If we notice that a model is consistently making mistakes in the same way — if it has a tendency to label too many records as positive, for instance — we might use that finding as a reason for adjusting the classification threshold. We can also use these metrics to help us with model comparisons. If we are choosing between different variants of a model, we might first weigh the costs of the various types of incorrect predictions, and then select the version that maximizes our expected gain.
Before we start the next section, we want to take a brief moment to go over the concept of a classification threshold. A classification threshold is the minimum probability required for a model to assign a record to the positive class. In most modeling systems, and in most everyday situations, 0.50 is the default threshold. Adjustments to the threshold do not change the underlying model itself — only the likelihood of a particular record’s positive class assignment.
Visualizing the TPR and the FPR on a Single Plot
A Receiver Operating Characteristics (ROC) plot is a helpful tool for enabling an analyst to visualize a classification model’s performance. Recently, we used ROC plots to help us compare the performance of a logistic regression model with that of a random forest model for predicting customer satisfaction at an all-you-can-eat buffet restaurant. In that classification problem, our response variable was a binary “yes” or “no” answer to a simple survey presented to guests as they left the restaurant: “Are you satisfied with your overall experience here today?”
The ROC plot for the logistic regression model that we used for the restaurant visitor satisfaction data is shown below. The code is also included here, to demonstrate the process for building such a graph in Python.
On an ROC plot, the FPR is shown on the x-axis, while the TPR is plotted on the y-axis. Along both of these axes, the values range from 0 to 1.
To build a conceptual sense of what this graph is showing, let’s start with the extreme positions in the bottom left corner (0,0) and the upper right corner (1,1).
What if we built a model that never predicted that any record would belong to the positive outcome class? (This equates to a classification threshold of 1, meaning the model would need to be 100% certain that a record belonged to the positive class before classifying it as such, which is an impossibility for a range-bound logistic regression model). What would this model have as a TPR and an FPR? If you said “0” for each of these, you are correct. This fictional (and admittedly ridiculous!) model would never identify any actual member of the positive class, nor would it ever predict that an actual member of the negative class would belong to the positive one. The point in the bottom left corner, where both the dotted black line and solid blue line originate, identifies the FPR and TPR of such a model.
Now, let’s take things to the other extreme. What if we built a model that just always assigned records to the positive class? (This would mean that it used a classification threshold of 0). Because this model would never fail to identify an actual positive class member, its TPR would be 1.0. That sounds impressive, but only for a split-second — once we stop to figure out that such a model would also have an FPR of 1.0, it begins to sound quite useless.
With those extremes having been established, let’s take a look now at the solid blue line in the figure above. Every point along this line represents our logistic regression model’s performance at some particular classification threshold — for practicality purposes, these thresholds aren’t actually labeled anywhere on the graph, but the solid line still informs us about the model’s overall capability. Reading this graph tells us, for example, that there is some threshold at which we can achieve a TPR of 0.8 and an FPR of 0.4 with this model. As we follow the path of the curve from left to right, we notice the tradeoff that it depicts: we can obtain a better TPR if we are willing to suffer a higher FPR, or we can improve our FPR if we’re willing to sacrifice a bit by moving down the curve to a lower TPR.
The dotted diagonal line shows the theoretical performance of a model that simply assigns records based on the “null model” that includes none of the input variables. If this approach were taken for classification, the percentage of records assigned to the positive class would just rise proportionally as the threshold value is adjusted downward, and vice versa.
A model that is excellent at distinguishing true positives from true negatives would have a solid line high above the dotted one. If the solid line were beneath the diagonal, by contrast, that would indicate that the model was less effective than random guessing.
Our second ROC plot, shown below, depicts the performance of the random forest model from the restaurant visitor satisfaction scenario.
Note the different shape that the solid blue line takes in this graph, compared with the one in the logistic regression model’s graph, shown earlier. Here, the solid blue line rises almost vertically at first, hugging the y-axis, until the TPR starts to extend beyond 0.7. By contrast, the first ROC curve showed that the logistic regression model could only obtain a TPR of 0.7 at a far higher cost — an FPR of 0.3. Furthermore, a curve-to-curve comparison reveals that the random forest can obtain a TPR of 0.9 with an FPR of just 0.3, whereas the logistic regression model could only obtain such a high TPR with an FPR of nearly 0.6.
Area under the Curve as a Classification Model Metric
When comparing different models’ ROC curves, the Area under the Curve (AUC) metric gives the modeler a more objective, quantitative basis for assessment than could be obtained with a simple “eyeball test” like the one used in the previous section. AUC tells us the percentage of the total plot area that falls beneath the solid line representing a model’s performance.
The highest possible value for the AUC metric would be 1. If a model could somehow predict all the true positives from within a dataset, without mistakenly predicting any false positives, its solid line would start at the bottom left and move directly up the y-axis, all the way to the top right corner, before moving horizontally across the top of the graph. In coordinate geometry terms, it would go from (0,0) to (0,1) and then to (1,0). All of the area in the TPR-FPR plot would fall beneath such a line.
The dotted line diagonal line, by contrast, has an AUC value of 0.5, since it bisects the plot perfectly, with half of the plot’s area below the diagonal, and half above.
For our model-vs.-model analysis, AUC quantifies the differences we observed in a precise way. The AUC results for the logistic regression and random forest models are shown below.
The .7282 AUC metric for the first model is still considerably better than the null model benchmark, but pales in comparison to the far more impressive .9197 value posted by the random forest model.
Of course, every dataset is unique, and so is every model. Perhaps even more importantly, the business purpose of every model is unique, which means that the costs of either type of misclassification can be highly idiosyncratic, even when similar algorithms are applied to similar datasets. For all of these reasons, we cannot suggest that there is some universal benchmark for a “good” AUC value. Regardless, AUC is an effective and concise way of capturing a model’s ability to accurately discern among outcome classes, as it conveys this ability with a single numeric value. For this reason, AUC familiarity should be part of the toolkit of any modeler using ROC curves to visually demonstrate model performance.
The author is a Master’s Degree student in Applied Business Analytics. He will graduate in Fall 2020. His co-author is a Senior Lecturer in Applied Business Analytics at Boston University.