Expanding the Limits of the Random Forest

“We all have forests on our minds. Forests unexplored, unending. Each one of us gets lost in the forest, every night, alone.”

― Ursula K. Le Guin, The Wind’s Twelve Quarters

by Qiuyue Wang
with Greg Page

Background: A Predictive Model for Hotel Cancellations

This article’s purpose is to introduce readers to the iterative nature of the grid search process in random forest hyperparameter tuning, and specifically the need to adjust the boundaries of the search grid when a recommended hyperparameter falls along a boundary on the grid.

This article assumes that the reader already has a baseline level of familiarity with the setup and execution of the RandomForestClassifier module from scikit-learn, the general structure of classification tree models, and the concept of hyperparameter tuning.

The dataset used to build this model, hotel-bookings.csv, can be found here. The original version of the dataset included 119,390 observations and 32 variables. Each observation represents one hotel room booking. Our outcome variable, is_canceled, indicates whether a prospective guest canceled the booking. All told, 37.55% of the records in the dataset canceled, whereas the other 62.45% did not.

During the data cleaning and Exploratory Data Analysis (EDA) phase of this project, we removed several variables due to factors such as redundancy, lack of association with the target variable, or an overwhelming proportion of NaN values.

After partitioning the data into a roughly 60/40 split between training and validation records, we generated the model using the RandomForestClassifier module from scikit-learn.

Scikit-Learn Default Settings and the Tree Overgrowth Problem

The default settings for an instance of the RandomForestClassifier module from scikit-learn do not place any limitations on tree growth. The default model hyperparameters are shown below:

As demonstrated by the screenshot above, these default settings indicate no restriction on the allowable number of nodes per tree (max_depth). There are also no constraints on the maximum number of leaf nodes, minimum number of samples in a leaf node, or minimum number of samples required to allow further splitting in any single tree (two is the lowest value that min_samples_split may take, as a node with a single value could not be split further). When considering whether to make additional splits, the trees in this model are not bound by any minimum threshold for the resulting impurity decrease.

Therefore, random forest models built in scikit-learn without user-specified hyperparameters are likely to show phenomenal performance against training set data — as they grow, they can continue to freely add any splits that improve node homogeneity. However, training set performance is not the primary benchmark for a predictive model, and we should always be mindful of the risk of overfitting to the training data. Of far greater interest to the modeler is the performance against yet-unseen data; here, that will be the 40 percent of observations that were randomly assigned to the test set during the partition.

Accuracy metrics for the untuned model against training and test sets are shown below:

Although the test set accuracy is considerably lower than that of the training set, the 86.48% percentage still looks quite impressive.

The question that we set out to answer from here was: “Can we improve this model’s performance against the test set through hyperparameter tuning?”

Hyperparameter Adjustments

To arrive at the best possible set of hyperparameters, we used a grid search process with five-fold cross validation. This process is exhaustive, as it requires the creation, fitting, and assessment of five separate models for each combination of hyperparameters and values used in the grid.

As shown below, this grid search iteration looked at three possible n_estimators values: 50, 100, and 150. The n_estimators hyperparameter determines the total number of trees to include in the model. The max_depth values of 2, 4, 6, and 8 correspond to limits placed on the number of splits that can occur between a tree’s root node and one of its terminal nodes. The max_features values of 2, 4, and 6 indicate the number of input variables (a.k.a. features) that should be considered by the model at each split. Like max_depth, min_samples_leaf is a way to limit the tree’s size — a tree cannot create a new split that would create a terminal node that would contain fewer records than the minimum value specified by a modeler.

While the grid search process can ultimately be rewarding, keep in mind that it is computationally expensive and time-consuming. All told, the code shown in the screenshot above generates 5 x 3 x 4 x 3 x 4 = 720 random forest models (each of which contains an average of 100 trees!) In an upcoming post, I will write about how the RandomizedGridSearch module offers a handy shortcut, but in the meantime, just keep in mind that this sort of exhaustive grid search is not recommended for anyone in a rush.

As for the questions of “Why those hyperparameters?” and “Why those ranges of values?”

With the grid search process, a modeler simply has to start somewhere. As results of successive tuning iterations become available, the modeler can then make adjustments to the selected hyperparameters and/or the range of values considered.

At first, this looked disappointing — the model’s performance against the test set was now considerably worse than before. However, the output from the grid search results offered important clues about where to move the range of values, as there were now edge cases among the recommended values for max_depth and max_features. The grid search’s recommended values for each of these hyperparameters fell on a boundary of the grid, suggesting that had higher values been included as options in the grid, those might have led to even better model performance.

Tweaking the Settings to Handle Edge Cases

As a result of seeing those edge cases appear in the results of our first tuning attempt, we increased the values for max_depth and max_features, as shown below. We slightly adjusted the n_estimators options and we removed a high min_samples_leaf value from the grid, too:

With these new settings in place, we saw a big accuracy jump, compared to the previous iteration. However, we still have not topped the original model’s test set performance. We have gotten quite close now, though, and yet again, we have reason to be encouraged that more gains are possible. Once more, the max_depth and max_features values from this tuning round are edge cases, indicating that the range of grid search options for these hyperparameters is due for further expansion.

For the next iteration, we will push the edges out even further for max_depth and max_features, while also slightly reducing the range of min_samples_leaf options:

The results shown above suggest that these were good steps to take, as the recommended values here for max_depth and max_features fell yet again on the boundaries of the grid. Let’s see whether we picked up any accuracy gains:

Alright! We have finally topped the initial, untuned model’s test set performance — its accuracy was 86.48%, and we are now up to 87.28%. Next, let’s add a few more n_estimators while also allowing for additional tree growth and more features for consideration per split:

With the latest round of adjustments, we have squeezed out some additional improvement, with test set accuracy now at 87.66%. It looks we may have found the best values to use for min_samples_leaf and max_features. Our incremental gains seem to be getting smaller, but we’re still improving, so let’s give it another go:

Now, the incremental gain from round to round has become incredibly tiny, but we’re more than 1.3 percentage points above the original, untuned benchmark. We made some incremental adjustments to the hyperparameters used in this round, but did not see further improvements against the test set.

Conclusion: Better Test Set Accuracy, with Scalability Potential

For this particular model, the first hyperparameter tuning round was a step backwards. However, each successive tuning iteration delivered accuracy gains that eventually brought us past the original benchmark. Looking back, we can see that increases to max_depth and max_features were most impactful, but we could not have anticipated that at the outset. A modeler using a grid search method must be willing to start somewhere, learn from the results, and then make adjustments.

Another important takeaway for readers is that hyperparameter settings do not exist in a vacuum — as some hyperparameters are adjusted, the recommended values for others can change as a result. In this example, note how min_samples_leaf continued to drop, and n_estimators increased, as the individual trees grew larger. If we had simply stuck with the Round I settings for these hyperparameters, without monitoring and adjusting them as we went, we would not have topped the untuned model’s test set accuracy.

Another limitation of this analysis is that the observations used for this modeling process all came from the same source. Even though we performed a data partition, and did all the cross-validation and model fitting using the training data only, our validation data was carved from the same original source as the training data. We would feel more confident in our model’s generalizability to new data if we were able to obtain new data from a different hotel, and then run our model against that data.

Lastly, we should note that while the performance improvement seen through these hyperparameter tuning rounds may seem small, even tiny percentage changes can have phenomenal impacts to profits when we consider scale. Let’s imagine that we are doing this analysis for a multinational hotel chain that owns nearly one million hotel rooms worldwide, spread across approximately 6,000 properties.

With some “back-of-the-envelope” math, let’s assume that this chain makes 400,000 bookings per day (with an average stay length of 2 nights) , that 35% of those are cancelled, and that an unpredicted cancellation costs each hotel an average of $50. With those assumptions in place, it follows that each full percentage point improvement in cancellation prediction saves the company $70,000 per day, or more than $25 million annually. Even though grid searches can seem frustratingly slow compared to most other computing processes that we know, that annual payoff is enough to make a few hours of processing time well worth the cost!

The author is a Master’s Degree student in Applied Business Analytics at Boston University. She will graduate in Fall 2020. Her co-author is a Senior Lecturer in Applied Business Analytics at Boston University.

Share this:

Related