If you’ve successfully built data models using Python and SciKit Learn, and you aren’t leveraging Pipelines and GridSearch, then I’m about to CHANGE YOUR LIFE! …Or at least that’s what I’d say if this were a clickbait article… But really, adding these two modeling objects to your code will improve readability, reproducibility, and decrease room for error in your code. Each of the modeling techniques we’re going to focus on is powerful and useful in its own rite, but when used in combination, magic happens.
First, a refresher
Before we dive into the main focus of this article, let’s take a moment and remind ourselves what these two SciKit Learn modules do on their own.
GridSearchCV
If you’re not using GridSearchCV
when building a model, you’re likely running your model over and over, trying to find the best parameters, having to remember which ones you’ve tried, and having to remember how each model scored. OR you’re repeating the code and running multiple models, cluttering up your coding environment. Either way, it’s not ideal. GridSearchCV
takes the monotony out of finding the best parameters. You can instead use GridSearchCV
to set the model you’re looking to optimize, pass in a list of parameter values you want to try, and then let the model run each iteration and store the best performing model based on your specified scoring method!
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCVxgb = XGBClassifier()
grid_params = {'learning_rate': [.05, 0.1, 0.2],
'max_depth': [4, 7, 10],
'subsample': [0.5, 0.7, 0.9],
'min_child_weight': [1, 2, 5, 7]}grid_search = GridSearchCV(xgb, param_grid = grid_params,
scoring = 'accuracy', cv=5)grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
y_hat = grid_search.predict(X_test)
The above block of code will run the classifier model using each unique combination of parameters possible given the grid_params
and the combination that provides the highestaccuracy
score will be stored in the grid_search
object. Methods in grid_search
can then be called to predict on holdout data. It may take a few minutes to run the .fit
since it’s running the classifier many times over, but you don’t have to track how each individual attempt scored, or which combinations you’ve tried!
Pipeline
For models that require preprocessing from modules that are either from SciKit learn or packages that mirror its syntax, we can wrap our preprocessing modules with our models, and call one method to preprocess the data and then fit it to the model. There is far deeper functionality that pipeline can accomplish alone, but for the purposes of this refresher, let’s keep it simple.
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipelinepipeline = Pipeline([('kni', 'KNNImputer()),
'ss', 'StandardScaler(),
'xgb', 'XGBClassifier()]pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)
Putting them together
Here’s where the magic happens! GridSearchCV
and Pipeline
play well together. You can make a Pipeline
, then create a GridSearch
parameter grid to try tweaking parameters in any of the objects in the pipeline. The result is with very concise code, you can test for the best combination of parmeters across all modules in your preprocessing to modeling pipeline, output predictions, and then score the model performance.
from sklearn.metrics import f1_scorepipeline = Pipeline([('kni', 'KNNImputer()),
'ss', 'StandardScaler(),
'xgb', 'XGBClassifier()]grid_params = {'kni__k_neighbors': [3, 6, 7, 10],
'kni__weights': ['distance', 'average'],
'xgb__learning_rate': [.05, 0.1, 0.2],
'xgb__max_depth': [4, 7, 10],
'xgb__subsample': [0.5, 0.7, 0.9],
'xgb__min_child_weight': [1, 2, 5, 7]}grid_search = GridSearchCV(pipeline, param_grid = grid_params,
scoring = 'accuracy', cv=3)grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
y_hat = grid_search.predict(X_test)
f1 = f1_score(y_test, y_hat)print(f'Model Score: {f1}')
To combine the two modules, we simply put the pipeline
variable as the first argument in GridSearchCV
and modify our grid_params
dictionary to include their respective model’s pipeline shortcode and two underscores before the name of each parameter. Then we fit, predict and score the same way!
Depending on how many parameters you’re testing, this may take some time. It’s as always important to use both common sense and your statistical knowledge to decide which parameters are appropriate to put in your parameter grid and consider the trade off of computational cost to the model performance benefit.
Hopefully this has you excited and dreaming up the possible uses of these modules. Model on!