Understanding Grid Search/Randomized CV’s (refit=True)
In this article I will share my experience and knowledge of implementing hyperparameter tuning with scikit-learn’s GridSearchCV or RandomizedCV. Concretely, I will focus on refit=True
parameter as that is the ones that confuses me the most in early use of GridSearch/RandomizedCV. I hope this article will help you build better understanding on that parameter
Pre-requisite
First off, as prerequisite a basic understanding in the following topics are expected:
- K-Fold CV
- Training a model
- Hyperparameters
I will try to explain briefly here, but I encourage you to read up information about it yourself.
Note
The focus of this article is stated above, consequently, here are things that will be ignored during data processing:
- Un-normalized features values
- The convergence warning due to un-normalized values
If you want to follow this tutorial, I suggest suppressing all warning by executing
import warnings
warnings.filterwarnings('ignore')
Problem formulation
The objective is as follows:
find the hyperparameter values of logistic regression model that gives best accuracy for binary classification.
Load dataset
Let’s use the readily available breast cancer dataset in scikit-learn package.
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_iris
import pandas as pd
import numpy as npdataset = load_breast_cancer()
dataset = pd.DataFrame(np.c_[dataset.data, dataset.target], columns=list(dataset.feature_names) + ['target']) dataset.head()
Peek at the dataset to see how many features there are, let’s also see how many samples/instances are available on each class
dataset.groupby('target').size()target
0.0 212
1.0 357
dtype: int64
So far, we know the dataset contains 569 samples with 30 features, 212 of those are classified as ‘1’ and 357 are classified as ‘0’. Let’s continue to split the data into train and test set.
Split into train and test set
X_all = dataset.iloc[:, :-1]
y_all = dataset.iloc[:, -1] from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.6, random_state=42) print('X_train shape:', X_train.shape)
print('y_train shape:',y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:',y_test.shape)
Output:
X_train shape: (341, 30)
y_train shape: (341,)
X_test shape: (228, 30)
y_test shape: (228,)
Okay, now data preprocessing steps are done. We are ready for the classification stage, first we need to build and train/fit our model into training data.
Possible FAQ so far:
Why set random_state to 42?
> You can set this number to any integer you want, the purpose of setting this random_state is for reproducibility, trivial reason why 42 is often used can be found hereWhy do you set train_size to 0.6?
> Usually, you want to split your data by ration of 7:3 or 8:2 (train:test), objective of this work is different, this is done for explanatory purpose
Training the model
Load the Logistic Regression, define a model, then train the model with our training data
from sklearn.linear_model
import LogisticRegression model_no_tune = LogisticRegression(max_iter= 10) model_no_tune.fit(X_train, y_train)
Output:
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=10, multi_class='auto',
n_jobs=None, penalty='l2',random_state=None,
solver='lbfgs', tol=0.0001,
verbose=0,warm_start=False)
The lengthy things inside the parentheses following LogisticRegression
is the initial default parameters of the model, some of them are hyperparameters whose values can be set according to our will. As an example, I set the max_iter
value to 10.
For this article let’s focus on two hyperparameters namely, C
and max_iter
. We'll find the best values for these in the later section.
Then let’s see how it performs on test data
model_no_tune.score(X_test, y_test)
Output:
0.9517543859649122
Good, 0.95 not bad at all. But is this really the best the model can give? Nope, remember we can tune the hyperparameters, but first let’s do a quick review what happens when model is trained.
What happens when we train the model?
This is where I expect you to have a knowledge of what happens when we train the model, knowledge of logistic regression might be helpful too.
Remember that when we train the model, we feed into the model our training data ( X_train
) and the target ( y_train
). The model will learn the weights/parameters (usually denoted by θ) that best fitted those data. n is determined by numbers of features. So, if there are 30 features in our data there will be 30 weights in our model. Let's confirm there are 30 weights in our model.
model_no_tune.coef_
Output:
array([[ 2.89479065e-03, 5.25384007e-03, 1.76541312e-02, 2.17186403e-02, 3.03361474e-05, 9.22071112e-06, -1.94952117e-05, -9.80070124e-06, 5.62485250e-05, 2.28559189e-05, 1.53327500e-05, 4.05040997e-04, 9.31758108e-05, -6.20673739e-03, 2.54362610e-06, 3.87518114e-06, 3.89117211e-06, 1.87377855e-06, 6.56052693e-06, 1.17953472e-06, 2.78523885e-03, 6.59304644e-03, 1.69345600e-02, -2.10331333e-02, 3.90652953e-05, 6.69649802e-06, -2.94063588e-05, -5.77885791e-06, 7.76807859e-05, 2.53727980e-05]])
Are there 30 of them?
model_no_tune.coef_.shape
Output
(1, 30)
Yes, we can see there are 30 numbers of weights/parameters in the model. There is also a bias value that we can see by calling
model_no_tune.coef_.shape
Output:
array([0.00036287])
I need you to pay attention to these weights because different set of training data or different set of hyperparameters values, will gives different weights values.
K-Fold Cross Validation Review
In k-fold cv we split the data into n-fold, then we will train the data on n-1 set (equivalent to training data) and evaluate the score on 1 set (equivalent to test data). Notice that this means each set should produce different set of model weights/parameters. Here’s an illustration of 3-fold cross validation.
The values of model weights set 1 is not equal to 2 or 3, vice versa. The takeaway is the model weights/parameters value depends on which training data and hyperparameters value it is being trained on.
Training with GridSearchCV
Now it’s time to set up our hyperparameters values, by convention you are expected to create a dictionary whose key corresponds to the model’s hyperpameters, recall that we want to adjust C
and max_iter
values of model_no_tune
from sklearn.model_selection import GridSearchCV parameters = {'C': [0.1, 1, 10],
'max_iter' : [10, 100, 1000]
}
GridSearchCV will set up pairs of parameters defined in the dictionary and use them as model parameters, in this example there will be 9 pairs:
For further information about how GridSearchCV works please refer to the documentation.
In total there will be 9 models each with different value of C and max_iter defined above, each model will be trained on the data fed into it.
To train with GridSearchCV we need to create GridSearchCV instances, define the number of cross-validation (cv) we want, here we set to cv=3.
grid = GridSearchCV(estimator=model_no_tune, param_grid=parameters, cv=3, refit=True) grid.fit(X_train, y_train)
Let’s take a look at the results
You can check by yourself that cv_results
also includes the information about time required to process the data, we will ignore time-related information and just see the score, by pd.DataFrame(cv_results)
we get:
Notice that there are 9 rows, each row represents model with different hyperparameter values. You can also infer which model perform the best by looking at mean_test_score
, which should correspond to rank_test_score
Alternatively, we can call grid.best_score_
to see the best score, this will gives the best mean_test_score
(aka. 1st place in rank_test_score
)
grid.best_score_
Output:
0.952957615277131
Plus, you can see the best parameters that corresponds to best score
grid.best_params_
Output:
{'C': 10, 'max_iter': 1000}
Extracting best model
Fortunately, grid instance provide us with best_estimator_
methods that return the best model with the best parameter
best_model = grid.best_estimator_
best_model
Output:
LogisticRegression(C=10, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
See the C
and max_iter
values matches the above grid.best_params
Now this is the part where I scratched my head the most. Recall that we did cross-validation to evaluate the best model score during grid search, and recall that each set inside those cross-validation yields different set of weights now if we see our weights in the best_model
print(best_model.coef_)
print(best_model.intercept_)
Output:
[[ 4.87236201 0.38557485 -0.50462669 -0.01637815 -0.38553339 -0.78678517 -1.57447757 -1.0836287 -0.67538905 -0.03815671 -0.40786062 3.59780421 0.35464063 -0.18371018 -0.06740272 0.0985651 0.06214831 -0.11794136 -0.15798064 0.02367259 0.33852914 -0.71118748 0.04939336 -0.01150091 -0.84846646 -1.90486052 -3.16630724 -1.73183156 -2.50614262 -0.1584183 ]] [1.65736284]
Now the question is, which set of data produce these weights?
To answer that we need to understand the role of refit = True
when defining grid instance (it is set to True by default). Here's a snippet from the documentation
refit: bool, str, or callable, default=True
Refit an estimator using the best found parameters on the whole dataset.
So, roughly this is how I picture what is going on under the hood of grid.fit(X_train, y_train)
Sanity check
Hypothesis:
If the model was trained on the whole training data, then a new model with hyperparameters equal to those in
best_model
will gives same weights and bias values, furthermore they will give the same score when evaluated on test data.
Let’s give it a try
First define new model named model_sanity
and assign best hyperparameters value to that model, ie. C = 10 and max_iter = 1000
model_sanity = LogisticRegression()
model_sanity.C = 10
model_sanity.max_iter = 1000
Fit the model to whole training data
model_sanity.fit(X_train, y_train)
Output:
LogisticRegression(C=10, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=1000, multi_class='auto',
n_jobs=None, penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)`
Check whether coef_
and intercept_
values are the same
print (np.all(model_sanity.coef_ == best_model.coef_))
print (np.all(model_sanity.intercept_ == best_model.intercept_))
The code above should produce True in both cases
Moreover, we can check the score of model_sanity
and best_model
on test data, they should be equal
print('model_sanity score: %.5f' %model_sanity.score(X_test, y_test)) print('best_model score: %.5f' %best_model.score(X_test, y_test))
Output:
model_sanity score: 0.96491
best_model score: 0.96491
And of course this score after hyperparameter tuning (0.96491) is better than score of model_no_tun
(0.95296)
And that’s it! now we know exactly what refit=True
does in GridSearchCV instance. Currently this is my go-to routine when tuning. It's worth noting that RandomizedSearchCV
also has the same methods and attribute that behave the same way as GridSearchCV
.
Finally this is my first article that involves code snippet in it, constructive feedbacks and suggestions are always welcome! Have a good day people!
Originally published at https://orvindemsy.github.io on October 3, 2020.