Understanding Grid Search/Randomized CV’s (refit=True)

Orvin Demsy
7 min readOct 3, 2020

--

In this article I will share my experience and knowledge of implementing hyperparameter tuning with scikit-learn’s GridSearchCV or RandomizedCV. Concretely, I will focus on refit=True parameter as that is the ones that confuses me the most in early use of GridSearch/RandomizedCV. I hope this article will help you build better understanding on that parameter

Pre-requisite

First off, as prerequisite a basic understanding in the following topics are expected:

  • K-Fold CV
  • Training a model
  • Hyperparameters

I will try to explain briefly here, but I encourage you to read up information about it yourself.

Note

The focus of this article is stated above, consequently, here are things that will be ignored during data processing:

  • Un-normalized features values
  • The convergence warning due to un-normalized values
    If you want to follow this tutorial, I suggest suppressing all warning by executing
import warnings 
warnings.filterwarnings('ignore')

Problem formulation

The objective is as follows:

find the hyperparameter values of logistic regression model that gives best accuracy for binary classification.

Load dataset

Let’s use the readily available breast cancer dataset in scikit-learn package.

from sklearn.datasets import load_breast_cancer 
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
dataset = load_breast_cancer()
dataset = pd.DataFrame(np.c_[dataset.data, dataset.target], columns=list(dataset.feature_names) + ['target'])
dataset.head()

Peek at the dataset to see how many features there are, let’s also see how many samples/instances are available on each class

Breast Cancer Dataset
dataset.groupby('target').size()target 
0.0 212
1.0 357
dtype: int64

So far, we know the dataset contains 569 samples with 30 features, 212 of those are classified as ‘1’ and 357 are classified as ‘0’. Let’s continue to split the data into train and test set.

Split into train and test set

X_all = dataset.iloc[:, :-1] 
y_all = dataset.iloc[:, -1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.6, random_state=42)
print('X_train shape:', X_train.shape)
print('y_train shape:',y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:',y_test.shape)

Output:

X_train shape: (341, 30) 
y_train shape: (341,)
X_test shape: (228, 30)
y_test shape: (228,)

Okay, now data preprocessing steps are done. We are ready for the classification stage, first we need to build and train/fit our model into training data.

Possible FAQ so far:
Why set random_state to 42?
> You can set this number to any integer you want, the purpose of setting this random_state is for reproducibility, trivial reason why 42 is often used can be found here

Why do you set train_size to 0.6?
> Usually, you want to split your data by ration of 7:3 or 8:2 (train:test), objective of this work is different, this is done for explanatory purpose

Training the model

Load the Logistic Regression, define a model, then train the model with our training data

from sklearn.linear_model 
import LogisticRegression
model_no_tune = LogisticRegression(max_iter= 10) model_no_tune.fit(X_train, y_train)

Output:

LogisticRegression(C=1.0, class_weight=None, dual=False, 
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=10, multi_class='auto',
n_jobs=None, penalty='l2',random_state=None,
solver='lbfgs', tol=0.0001,
verbose=0,warm_start=False)

The lengthy things inside the parentheses following LogisticRegression is the initial default parameters of the model, some of them are hyperparameters whose values can be set according to our will. As an example, I set the max_iter value to 10.

For this article let’s focus on two hyperparameters namely, C and max_iter. We'll find the best values for these in the later section.

Then let’s see how it performs on test data

model_no_tune.score(X_test, y_test)

Output:

0.9517543859649122

Good, 0.95 not bad at all. But is this really the best the model can give? Nope, remember we can tune the hyperparameters, but first let’s do a quick review what happens when model is trained.

What happens when we train the model?

This is where I expect you to have a knowledge of what happens when we train the model, knowledge of logistic regression might be helpful too.

Remember that when we train the model, we feed into the model our training data ( X_train) and the target ( y_train). The model will learn the weights/parameters (usually denoted by θ) that best fitted those data. n is determined by numbers of features. So, if there are 30 features in our data there will be 30 weights in our model. Let's confirm there are 30 weights in our model.

model_no_tune.coef_

Output:

array([[ 2.89479065e-03, 5.25384007e-03, 1.76541312e-02, 2.17186403e-02, 3.03361474e-05, 9.22071112e-06, -1.94952117e-05, -9.80070124e-06, 5.62485250e-05, 2.28559189e-05, 1.53327500e-05, 4.05040997e-04, 9.31758108e-05, -6.20673739e-03, 2.54362610e-06, 3.87518114e-06, 3.89117211e-06, 1.87377855e-06, 6.56052693e-06, 1.17953472e-06, 2.78523885e-03, 6.59304644e-03, 1.69345600e-02, -2.10331333e-02, 3.90652953e-05, 6.69649802e-06, -2.94063588e-05, -5.77885791e-06, 7.76807859e-05, 2.53727980e-05]])

Are there 30 of them?

model_no_tune.coef_.shape

Output

(1, 30)

Yes, we can see there are 30 numbers of weights/parameters in the model. There is also a bias value that we can see by calling

model_no_tune.coef_.shape

Output:

array([0.00036287])

I need you to pay attention to these weights because different set of training data or different set of hyperparameters values, will gives different weights values.

K-Fold Cross Validation Review

In k-fold cv we split the data into n-fold, then we will train the data on n-1 set (equivalent to training data) and evaluate the score on 1 set (equivalent to test data). Notice that this means each set should produce different set of model weights/parameters. Here’s an illustration of 3-fold cross validation.

K-Fold CV

The values of model weights set 1 is not equal to 2 or 3, vice versa. The takeaway is the model weights/parameters value depends on which training data and hyperparameters value it is being trained on.

Training with GridSearchCV

Now it’s time to set up our hyperparameters values, by convention you are expected to create a dictionary whose key corresponds to the model’s hyperpameters, recall that we want to adjust C and max_iter values of model_no_tune

from sklearn.model_selection import GridSearchCV  parameters = {'C': [0.1, 1, 10],     
'max_iter' : [10, 100, 1000]
}

GridSearchCV will set up pairs of parameters defined in the dictionary and use them as model parameters, in this example there will be 9 pairs:

9-pairs of hyperparmeters combination

For further information about how GridSearchCV works please refer to the documentation.

In total there will be 9 models each with different value of C and max_iter defined above, each model will be trained on the data fed into it.

To train with GridSearchCV we need to create GridSearchCV instances, define the number of cross-validation (cv) we want, here we set to cv=3.

grid = GridSearchCV(estimator=model_no_tune, param_grid=parameters, cv=3, refit=True) grid.fit(X_train, y_train)

Let’s take a look at the results

You can check by yourself that cv_results also includes the information about time required to process the data, we will ignore time-related information and just see the score, by pd.DataFrame(cv_results) we get:

Inside of cv_results minus time-related info

Notice that there are 9 rows, each row represents model with different hyperparameter values. You can also infer which model perform the best by looking at mean_test_score, which should correspond to rank_test_score

Alternatively, we can call grid.best_score_ to see the best score, this will gives the best mean_test_score (aka. 1st place in rank_test_score)

grid.best_score_

Output:

0.952957615277131

Plus, you can see the best parameters that corresponds to best score

grid.best_params_

Output:

{'C': 10, 'max_iter': 1000}

Extracting best model

Fortunately, grid instance provide us with best_estimator_ methods that return the best model with the best parameter

best_model = grid.best_estimator_ 
best_model

Output:

LogisticRegression(C=10, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

See the C and max_iter values matches the above grid.best_params

Now this is the part where I scratched my head the most. Recall that we did cross-validation to evaluate the best model score during grid search, and recall that each set inside those cross-validation yields different set of weights now if we see our weights in the best_model

print(best_model.coef_) 
print(best_model.intercept_)

Output:

[[ 4.87236201 0.38557485 -0.50462669 -0.01637815 -0.38553339 -0.78678517 -1.57447757 -1.0836287 -0.67538905 -0.03815671 -0.40786062 3.59780421 0.35464063 -0.18371018 -0.06740272 0.0985651 0.06214831 -0.11794136 -0.15798064 0.02367259 0.33852914 -0.71118748 0.04939336 -0.01150091 -0.84846646 -1.90486052 -3.16630724 -1.73183156 -2.50614262 -0.1584183 ]] [1.65736284]

Now the question is, which set of data produce these weights?

To answer that we need to understand the role of refit = True when defining grid instance (it is set to True by default). Here's a snippet from the documentation

refit: bool, str, or callable, default=True
Refit an estimator using the best found parameters on the whole dataset.

So, roughly this is how I picture what is going on under the hood of grid.fit(X_train, y_train)

Under the Hood of grid.fit()

Sanity check

Hypothesis:

If the model was trained on the whole training data, then a new model with hyperparameters equal to those in best_model will gives same weights and bias values, furthermore they will give the same score when evaluated on test data.

Let’s give it a try

First define new model named model_sanity and assign best hyperparameters value to that model, ie. C = 10 and max_iter = 1000

model_sanity = LogisticRegression() 
model_sanity.C = 10
model_sanity.max_iter = 1000

Fit the model to whole training data

model_sanity.fit(X_train, y_train)

Output:

LogisticRegression(C=10, class_weight=None, dual=False, 
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)`

Check whether coef_ and intercept_ values are the same

print (np.all(model_sanity.coef_ == best_model.coef_)) 
print (np.all(model_sanity.intercept_ == best_model.intercept_))

The code above should produce True in both cases

Moreover, we can check the score of model_sanity and best_model on test data, they should be equal

print('model_sanity score: %.5f' %model_sanity.score(X_test, y_test)) print('best_model score: %.5f' %best_model.score(X_test, y_test))

Output:

model_sanity score: 0.96491 
best_model score: 0.96491

And of course this score after hyperparameter tuning (0.96491) is better than score of model_no_tun (0.95296)

And that’s it! now we know exactly what refit=True does in GridSearchCV instance. Currently this is my go-to routine when tuning. It's worth noting that RandomizedSearchCV also has the same methods and attribute that behave the same way as GridSearchCV.

Finally this is my first article that involves code snippet in it, constructive feedbacks and suggestions are always welcome! Have a good day people!

Originally published at https://orvindemsy.github.io on October 3, 2020.

--

--

Orvin Demsy

I have a keen interest on machine learning, BCI, finance, math, and data.