Hyper-Parameter Tuning Workflow with Sci-kit Learn and Optuna

Kaggle — Child Mind Institute Detect Sleep States Dataset.

6 min readOct 18, 2023

--

8 Tuning forks sloped down diagonally, left to right.
Photo by Jacek Ulinski on Unsplash

Introduction

In this post, I create a machine learning workflow, I show how inputted raw data can be integrated, transformed and optimized into some useful model. The machine learning workflow is somewhat unique in the sense that it is geared towards the specific technique of hyperparameter tuning. For data manipulation/transformations I am using Python functions incorporating libraries from pandas, scikit-learn, and Optuna. For data storage, SQLite, and Optuna-dashboard for the data visualization. The dataset I will be using is Child Mind Institute (CMI) Detect Sleep States from Kaggle, it is composed of training and validation data in csv and parquet file formats. According to the data description:

The dataset comprises about 500 multi-day recordings of wrist-worn accelerometer data annotated with two event types: onset, the beginning of sleep, and wakeup, the end of sleep. Your task is to detect the occurrence of these two events in the accelerometer series.

Additional detail on the field and how the data is defined and collected can be found on the data descriptions page, below is a synopsis of the dataset provided:

sample_submission.csv: [209 B] contains example values for row_id, series_id, step, event (target variable), and score.

test_series.parquet: [4.59 kB] contains values for series_id, step, timestamp, anglez, and enmo. Used to generate predictions for submission.

train_events.csv: [635.3 kB] sleep log data, contains values for series_id, night, event, step, and timestamp.

train_series.parquet: [985.82 MB] accelerometer data, contains values for series_id, step, timestamp, anglez, and enmo.

In addition to the competitions dataset, I include a couple more datasets provided by an excellent Kaggler. This reduced datasets size speeds up development of the workflow and prevents out of memory errors that may occur with the much larger training datasets.

In the accelerometer training data we have fields to create features (X-variables) and the sleep log data includes the target values (y-prediction) in the event column. The goal is to create a model to predict the event type, either onset or wakeup (or neither), of the test series data. The actual submission will only have events classified as onset or wakeup and their associated prediction score, this event detection problem can be modeled as a supervised multiclass classification of time series data.

Objectives:

  • Design a hyperparameter optimization workflow for the CMI Sleep Detection.
  • Optimize the hyperparameters from an example Random Forest Classifier.
  • Evaluate Optuna dashboard trial run results and the hyperparameter plots.
Left to right: Kaggle, Parquet, csv, scikit-learn, csv, optuna, SQLite, and Optuna-dashboard.
Hyperparameter Tuning Workflow Diagram (by Author)

Workflow Development

Hyperparameter optimization (HPO) a.k.a Hyperparameter tuning is the process by which a machine learning model’s hyperparameters are adjusted to an optimal value. A model’s hyperparameters differ from its parameters in that the hyperparameters are inputted by the user while the parameters are learned by the model based on the data provided. For example, sci-kit-learn’s implementation of the K-Nearest Neighbor Classifier has the hyperparameter, k, while the algorithm learns other parameters based on the data it is fitted on.

HPO Strategies: Brute-force vs Optimization-algorithms

A well-known strategy for tuning hyperparameters is the brute-force approach in which an exhaustive search is performed to find the optimal hyperparameter(s). GridSearch is a traditional way of implementing this brute-force approach. The drawback of GridSearch is it is resource-intensive (takes too much computing and time) and is generally less efficient. An alternative strategy is to intelligently explore the hyperparameter search space by using optimization algorithms i.e. bayesian optimization. The speed and efficiency benefits become clearer for larger datasets.

Reading the above workflow diagram from left to right, it starts with the ingestion of the datasets described in the introduction. Next using Python libraries I perform data preprocessing and feature engineering, (cleaning, standardization, normalization, transformations). These are crucial steps in getting the data in the right format for the model and creating features for better results. I then fit the processed dataset into the training data and use the fitted model, in this case I am using a Random Forest (RF) Classifier. Next, I generate predictions and evaluate i.e. score those predictions. Then to tune the hyperparameters, I define an Optuna objective function to be used to run a study. The objective function essentially does all the steps mentioned to generate the predictions and scores with the key addition of the function iterating over trials to optimize the score. The storage, SQLite, and dashboard, Optuna-dashboard, are specifically used to store and visualize the optimization results thereby assisting in improving the model score.

The optimization function I created leverages Optuna’s library and is essentially a wrapper around the RF classifier algorithm. First, I define an objective study that has an associated SQLite storage. Then I define the hyperparameters to optimize:

n_estimators — the number of trees inside the classifier, scikit-learn default is 100.

max_depth — maximum height to which the trees inside the forest can grow, scikit-learn default is None.

Also I define the suggested search space for the HPO study. Lastly, when I instantiate the study object’s optimization function I am maximizing the results and passing the number of trials parameter. By default, under the hood, Optuna uses Tree-Structured Parzen Estimator (TPE) algorithm which is a type of Bayesian optimizer. When the objective study function runs it saves the optimization history from the trials into a local SQLite database, which is read and visualized by the Optuna-dashboard.

def objective(trial):
"""
Define the objective function to be maximized.
Returns the accuracy score.
"""

# hyperparameter search space

# rf model hyperparameters
n_estimators = trial.suggest_int("n_estimators", 10, 100)
max_depth = trial.suggest_int("max_depth", 2, 10, log=True)

# rf model training
classifier = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=21)
classifier.fit(X_train, y_train)

# generate predictions & evaluate
y_pred = classifier.predict(X_test)
accuracy = score(y_test, y_pred, tolerances, **column_names)

return accuracy

# Execute study object & persist trial runs.
study_name = "cmi-study"
storage_name = "sqlite:///cmi-sleep.db".format(study_name)
study = optuna.create_study(study_name=study_name, storage=storage_name, load_if_exists=True, direction="maximize")
study.optimize(objective, n_trials=10)
Optuna objective study function console output (by Author)

Visualizations created leverages Optuna’s dashboard. Running the command optuna-dashboard <path_to_sqlite_file> creates a local instance of the dashboard on localhost default port 8080. From the dashboard I can view plots of the results and trial history. Of particular mention is the Hyperparameter Importance plot, which depicts based on the trial results a relative comparison of each hyperparameter significance in regards to generating the results. This insight can assist in selecting features and deciding hyperparameter to focus on optimizing. Some other important information from the dashboard are the History plot, Timeline plot and Best Trial result, this shows details on the hyperparameter values that yielded the best scores.

Hyperparameter Importance plot (by Author) & Optuna Dashboard with trial histories (by Optuna)

Conclusion

In this post, I have shown how to implement a less naive approach to hyperparameter optimization by applying the Optuna framework. Using an optimization framework such as Optuna is advantageous for larger datasets, in comparison to GridSearch and RandomSearch. I did not discuss the framework’s other features such as pruning strategies and parallelization, although these greatly decrease optimization duration. Using the Random Forest Classifier, I have depicted a workflow for improving a model’s performance by tuning the hyperparameters. In comparison to the other steps, i.e. feature engineering and preprocessing, hyperparameter tuning is just the tip of the iceberg, or as the adage goes: premature optimization is the root of all evil. But when used correctly hyperparameter tuning can yield good gains. By optimizing the inputs, the tuning workflow process leads to incrementally better scores and predictions.

References

Optuna Documentation

Optuna Dashboard Documentation

Python Optuna: A Guide to Hyperparameter Optimization

Beyond Grid Search: XGBoost and Optuna as the ultimate ML Optimization Combo — William Arias

Scikit-Learn Ensemble Random Forest Classifier

Thanks for reading! If you want to get in touch with me, feel free to reach me on my LinkedIn Profile. You can also view some code in my GitHub.

--

--

Bayo Adejare
Bayo Adejare

Written by Bayo Adejare

Data Engineer — Building the Modern Data Stack, byte by byte. All views are mine.