Using GridSearchCV to optimize your Machine Learning model

Hyper-parameter tuning can help you get the best performance out of your model

Photo by Jorge Salvador on Unsplash

When training your machine learning model, creating a baseline, or “vanilla” model without any specific tuning is most often the protocol. We do this to benchmark how well our model will perform on our data. So when we tune the hyperparameters, we can tell if improvements are being made. But how can we improve our model?

One way to accelerate the process of improving our model is with a cross-validation tool called GridSearch. With GridSearch CV, we define a range of values for our selected parameters. We then iterate through every combination of these parameters, to see which combination improves our selected cost function the most. Check out the documentation for GridSearchCV here.

For example I have provided the code for a random forest, ternary classification model below. I will demonstrate how to use GridSearch effectively and improve my model’s performance

A quick summary of random forest models: A random forest is essentially an ensemble of decision trees. A decision tree is nicknamed a “greedy algorithm” as it makes ‘decisions’ to split features where there is the greatest information gain.

First, import the necessary libraries:

We will rely on the sklearn library to train/test split our data, gridsearch, as well as create a pipeline with standard scaling and the random forest model itself.

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Now we will split our data, using ‘status_group’ as our labels. These are statuses of water-well functionality. There are three status labels: functional, non-functional, and functional needs repair.

X = final_model_data.drop(labels=['id', 'status_group'], axis=1)
y = final_model_data.status_group
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Then I create a pipeline to streamline our process. The idea of the pipeline is to consolidate steps to save time. Here I am running StandardScaler (to get our numerical features on the same scale) and the Random Forest ensemble classifier model.

def machine_learn(model) :
model_pipeline = Pipeline([('ss', StandardScaler()),
('model', model)])
fitted_model = model_pipeline.fit(X_train, y_train)
print("Accuracy Score:", fitted_model.score(X_test, y_test))
model_preds = fitted_model.predict(X_test)
print(classification_report(y_test, model_preds))
print(confusion_matrix(y_test, model_preds))

Then we quickly run our baseline model using our pipeline function, with a set random_state for reproducibility, and n_estimators to 200. ‘n_estimators’ is essentially the number of decision trees in our random forest. For more information on how a decision tree operates, have a look at this post by Chirag Sehra.

machine_learn(RandomForestClassifier(random_state=123, n_estimators=200))

Our baseline model comes with an accuracy of .802. This is actually pretty decent. Note* This is likely due to my already extensive cleaning of my dataset, but that is for another whole blog post! We are only focusing on how to work GridSearch here.

Now let’s see if we can possibly improve our model’s accuracy. I will enter a wide range of values for which the GridSearch will exhaustively search over to come up with a best accuracy score.

GridSearch

newer_grid = [{'RF__max_depth': [8, 12, 16], 
'RF__min_samples_split': [12, 16, 20],
'RF__criterion': ['gini', 'entropy']}]
gridsearch = GridSearchCV(estimator=rf_pipeline,
param_grid=newer_grid,
scoring='accuracy',
cv=5)
gridsearch.fit(X_train, y_train)gridsearch.score(X_test, y_test)#0.7978050982843647 - accuracygridsearch.best_params_# returns: {'RF__criterion': 'entropy', 'RF__max_depth': 16, 'RF__min_samples_split': 12

Our accuracy score was even lower than our baseline! If you are wondering how this is possible, it is because I have entered specific values for certain parameters, which restrict the model operation. Our lower accuracy tells us that our desired parameters are likely out of the range which I have specified.

As the max depth best param was the highest, let’s try leaving this as default, which means there is no limit to the depth of the trees. This can be prone to overfitting the model, but we are judging based on our test set values so we can disregard that. Min_samples_split on the other hand, is the minimum number of samples required to split an internal node. Let’s work at that parameter, while leaving ‘RF_criterion’ on ‘entropy’. These criterion measure the impurity of each node. The only real difference between ‘gini’ and ‘entropy’ is that ‘entropy’ measuring using the logarithmic function for calculating the impurity, which is more computationally expensive.

I am still leaving the range for min_samples_split a bit wide. If our best_params spit out a value within this range, then we narrow down our values.

newer_grid = [{ 
'RF__min_samples_split': [8, 10, 12],
'RF__criterion': ['entropy']
}]

gridsearch = GridSearchCV(estimator=rf_pipeline,
param_grid=newer_grid,
scoring='accuracy',
cv=5)
gridsearch.fit(X_train, y_train)gridsearch.score(X_test, y_test)#accuracy: 0.8083628533722303gridsearch.best_params_#{'RF__criterion': 'entropy', 'RF__min_samples_split': 8}

But a bit of an improvement! Let’s try one more iteration. Because our best_param for min_samples_split was the lowest in my range of values, it is possible that a better value is still lower than that. I will enter 6, 7, and 8 for this round, narrowing down even more than last round.

last_grid = [{ 
'RF__min_samples_split': [6, 7, 8],
'RF__criterion': ['entropy']
}]
gridsearch = GridSearchCV(estimator=rf_pipeline,
param_grid=last_grid,
scoring='accuracy',
cv=5)
gridsearch.fit(X_train, y_train)
gridsearch.score(X_test, y_test)#accuracy: 0.8083628533722303gridsearch.best_params_#{'RF__criterion': 'entropy', 'RF__min_samples_split': 7}

Now we have a ‘best’ value within our very narrow range. So let’s plug these values from our GridSearch back into our model with 200 estimators and calculate some summary stats, using the same random_state as our baseline.

machine_learn(RandomForestClassifier(random_state=123, n_estimators=200,
min_samples_split=7, criterion='entropy'))

Slight improvement! GridSearch doesn’t always make leaps and bounds but any percentage of improvement is worth it. I hope you were able to learn at least something from this post.

If you have any questions at all about my methods or process, please comment! I am always looking to learn.

-Orin

I’m a recent Data Science graduate with a B.S. in Environmental Science. Currently seeking job opportunities. Constantly learning!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store