Python / Data Science / Modeling / Tutorial

Leveraging Pipeline and Gridsearch

Joe Marx
4 min readJan 14, 2021

Streamline your model optimization by combining these two powerful modules

If you’ve successfully built data models using Python and SciKit Learn, and you aren’t leveraging Pipelines and GridSearch, then I’m about to CHANGE YOUR LIFE! …Or at least that’s what I’d say if this were a clickbait article… But really, adding these two modeling objects to your code will improve readability, reproducibility, and decrease room for error in your code. Each of the modeling techniques we’re going to focus on is powerful and useful in its own rite, but when used in combination, magic happens.

First, a refresher

Before we dive into the main focus of this article, let’s take a moment and remind ourselves what these two SciKit Learn modules do on their own.

You running your model over and over again looking for the best parameters without gridsearchCV

GridSearchCV

If you’re not using GridSearchCV when building a model, you’re likely running your model over and over, trying to find the best parameters, having to remember which ones you’ve tried, and having to remember how each model scored. OR you’re repeating the code and running multiple models, cluttering up your coding environment. Either way, it’s not ideal. GridSearchCV takes the monotony out of finding the best parameters. You can instead use GridSearchCV to set the model you’re looking to optimize, pass in a list of parameter values you want to try, and then let the model run each iteration and store the best performing model based on your specified scoring method!

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
xgb = XGBClassifier()
grid_params = {'learning_rate': [.05, 0.1, 0.2],
'max_depth': [4, 7, 10],
'subsample': [0.5, 0.7, 0.9],
'min_child_weight': [1, 2, 5, 7]}
grid_search = GridSearchCV(xgb, param_grid = grid_params,
scoring = 'accuracy', cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
y_hat = grid_search.predict(X_test)

The above block of code will run the classifier model using each unique combination of parameters possible given the grid_params and the combination that provides the highestaccuracy score will be stored in the grid_search object. Methods in grid_search can then be called to predict on holdout data. It may take a few minutes to run the .fit since it’s running the classifier many times over, but you don’t have to track how each individual attempt scored, or which combinations you’ve tried!

Pipeline

For models that require preprocessing from modules that are either from SciKit learn or packages that mirror its syntax, we can wrap our preprocessing modules with our models, and call one method to preprocess the data and then fit it to the model. There is far deeper functionality that pipeline can accomplish alone, but for the purposes of this refresher, let’s keep it simple.

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('kni', 'KNNImputer()),
'ss', 'StandardScaler(),
'xgb', 'XGBClassifier()]
pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)

Putting them together

Here’s where the magic happens! GridSearchCV and Pipeline play well together. You can make a Pipeline, then create a GridSearch parameter grid to try tweaking parameters in any of the objects in the pipeline. The result is with very concise code, you can test for the best combination of parmeters across all modules in your preprocessing to modeling pipeline, output predictions, and then score the model performance.

from sklearn.metrics import f1_scorepipeline = Pipeline([('kni', 'KNNImputer()),
'ss', 'StandardScaler(),
'xgb', 'XGBClassifier()]
grid_params = {'kni__k_neighbors': [3, 6, 7, 10],
'kni__weights': ['distance', 'average'],
'xgb__learning_rate': [.05, 0.1, 0.2],
'xgb__max_depth': [4, 7, 10],
'xgb__subsample': [0.5, 0.7, 0.9],
'xgb__min_child_weight': [1, 2, 5, 7]}
grid_search = GridSearchCV(pipeline, param_grid = grid_params,
scoring = 'accuracy', cv=3)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
y_hat = grid_search.predict(X_test)
f1 = f1_score(y_test, y_hat)
print(f'Model Score: {f1}')

To combine the two modules, we simply put the pipeline variable as the first argument in GridSearchCV and modify our grid_params dictionary to include their respective model’s pipeline shortcode and two underscores before the name of each parameter. Then we fit, predict and score the same way!

Depending on how many parameters you’re testing, this may take some time. It’s as always important to use both common sense and your statistical knowledge to decide which parameters are appropriate to put in your parameter grid and consider the trade off of computational cost to the model performance benefit.

Hopefully this has you excited and dreaming up the possible uses of these modules. Model on!

--

--

Joe Marx

Actor turned Data Scientist looking for applause some other way