How to get started with Hyper-parameter Optimization

Ege Hosgungor

11 min readDec 31, 2021

Fasten your seat belts cause traveling through **hyperspace** ain’t like dusting crops 😎 (Got the Han Solo reference 😏 ? No ? okay…)

Any end-to-end machine learning project can be divided into several stages:

Get the Data
Exploratory Data Analysis and Visualization
Preprocessing and Cleaning the Data
Select and train a model
Hyper-parameter Optimization (HPO)
Present your solution
Launch, monitor, and maintain your system.

In this post, I will walk you through the following hyper-parameter optimization approaches in details:

Introduction

Let’s assume you understood your data well, preprocessed & made it ready for training. At this stage in a machine learning project you have some shortlist of promising models (Neural Network, Random-Forest, SVM, XgBoost) and you would like to optimize their hyper-parameters in order to see which model would be the best for your task. But first, what is hyper-parameters ?

A hyper-parameter is a parameter whose value is used to control the traning process by contrast with model parameters which are learned during the process. Another words the difference between hyper-parameter optimization of a model and training a model is as follows:

Model parameters are learned during model training. For example weights in a neural network changes in a training with every iteration.

Hyper-parameters are independent variables from the training. These parameters are specific to the algorithm that has been chosen and users can arbitrarily set these parameters. (eg. learning rate of a neural network, number of estimators in a Random Forest).

Hyper-parameter optimization happens on top of model training and this optimization is needed in order to get the best result out of a model. In the next chapter we will see different approaches to optimize these parameters.

Hyper-parameter Optimization Approaches

In this chapter I explained most of the methods in detail from simplest to the edge of technology ones, how they work and where to use them. Since there are many practical posts about how to code these approaches in python, I felt like it is better to give a high-level overview of HPO. I also recommend watching Andrew Ng’s amazing content about Tuning Process.

Manual Search

One option for optimizing your model’s hyper-parameters would be manual search. Trying different values till you find a great combination of them. This is mostly likely to be failed and will be counted as a self-torture if you try to do it with multiple parameters. Why ?

Because it is not a linear process to understand which value is better for a certain parameter so you cannot iterate it easily. If you have multiple parameters than it’s even much more complex space that you need to scan. Without a decent strategy manual search is literally looking for a needle in a haystack.Data Scientists like automation. The whole purpose of teaching models how to behave in a certain ways is to get rid of element of repetitive work.

Parallel Search

The simplest automation that you can do is a parallel search. The idea is training number of different models on parallel and comparing their performance with each other instead of guessing what should try next. This is called grid search. When you find a range of values that works for your model, you can zoom in to that range and try different values in a smaller space. This can be an iterative process where you tune your parameters till they don’t affect the performance anymore. This only requires the time for one training run, but requires the use of more computational resources to train many models in parallel.

A better version of grid search is called random search where you tweak the values a little bit each time randomly so you don’t try the exactly same values twice. This works better than grid search. On the picture below the green area with the peak shows the best combination of these two features. As can be seen random search’s best trial is much closer to the peak than the best trial of grid search. This video of Andrew Ng might be helpful to understand better how to choose these random values.

With Grid Search, nine trials only test three distinct places. With Random Search, all nine trails explore distinct values.

Why we are not satisfied with this technique ? Let me give you an example:

If we want to try 3 different values for parameter_1 then we can try all of them and compare their results. If we want to try 3 different values for parameter_1 and 3 different values for parameter_2 then we will have 3x3 = 9 different trials. I think you got the idea, the number of trial counts will exponentially increase with number of parameters and values.Trying all of them one by one is not efficient. (Although parallel search is the most common hpo approach because of its simplicity.)

Sequential Search

Till now we didn’t use any mathematical approach, we were just up all night to get lucky (pun intended), but there are better ways. As we just mentioned the parallel search was inefficient because you need to try all of the values, instead we should be more clever and train the least number of models while we look for the best hyper-parameter combination.

Sequential search is a way to train a model at each step, using a specific set of parameters. The algorithm uses the information from all the previous steps to define the set of parameters that will be tested in the next step/iteration. By doing this, the algorithm focuses the search on the most promising parameters.

Bayesian optimization is a global optimization method for any black-box system.(a system for which we can only observe the inputs and outputs, but not the internal workings itself) In the context of hyper-parameter optimization of machine learning models, Bayesian optimization is of particular relevance as the computational costs for evaluating model variations is high and often objective function is complex, does not have accessible derivatives, or when the problem at hand is non-convex. (objective function is the function that you want to minimize or maximize in your problem. e.g minimizing of cost function, maximizing reward function)

In contrast of parallel search methods, Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Surrogate function: It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate.

The surrogate is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyper-parameters to evaluate on the actual objective function by selecting hyper-parameters that perform best on the surrogate function. The sequential search gives you the advantage of

Reduced running time of hyper-parameter optimization
Better scores on the testing set

Gradient Descent: For specific algorithms (neural networks, SVMs and logistic regression etc.) it is possible to compute the gradient with respect to hyper-parameters and then optimize the hyper-parameters using gradient descent. The experimental results show that the gradient-based algorithms are also significantly more effective than parallel search-based algorithms when the number of hyper-parameters is large. Being applicable on neural networks make gradient descent algorithms a popular choice in deep learning.

Evolutionary Search

The evolutionary search consist of genetics algorithms make use of basic concepts from evolution by natural selection. Genetic algorithms have wide use cases just like Bayesian Theorem however they are quite overlooked in the context of hyper-parameter optimization.

I always get fascinated by 2D Box Car simulations where you can understand how genetic algorithms works in a much more practical sense. The simulation have ground rules like every car will have 2 wheels, they must be connected by a chassis etc. Then there are some parameters such as size of the wheels, shape of the chassis. These are the parameters that we would like to optimize for the best car. The simulation starts with random population and by the end of each run best candidates transfer their genes (parameters) to the next generation. Random parameter mutations are introduced in the population, and cars with a higher fitness score outlive others.

Genetic evolution of a wheeled vehicle with Box2d

We can use this natural selection concept for hyper-parameter optimization of our machine learning models. Lets say we have a population of N Machine Learning models with some predefined parameters. We can then train these models and decide to keep half of the models with the best score. From these N/2 models we can generate new population of N ML models with similar hyper-parameters. I’m saying similar not exact because than whats the point of generating a new population. One flaw that genetic algorithms has that as mutations are purely random the consequences are also random just as evolution. I would again recommend reading this amazing blog post for more.

Population Based Search

The Population Based Training (PBT) is an ML hyper-parameter optimization method developed by DeepMind. This method combines both parallel and sequential search methods I will tell you how:

The diagram above shows how PBT works. Training starts like parallel search, randomly sampling hyper-parameters and weight initializations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyper-parameters by modifying the better model’s hyper-parameters, before training is continued. This process allows hyper-parameters to be optimized online, and the computational resources to be focused on the hyper-parameter and weight space that has most chance of producing good results. The result is a hyper-parameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions.

It is most analogous to evolutionary search but rather than evolving parameters, PBT train models partially with standard optimization methods such as gradient descent. also the main purpose of PBT is to provide a way to optimize both model parameters θ and the hyper-parameters h jointly on the actual metric Q that we care about.

A Reward/Step Graph. PBT on a reinforcement learning task. 6 different samples running on parallel.

Here is a graph from an actual reinforcement learning task running with Population Based Training. Although each model is getting different rewards, overall all of them are getting improved. This is because periodically these models are sharing their parameters both model and hyper-parameters.

This combined training of a ML model can be used for any task but its especially popular within the reinforcement learning (RL) community. RL algorithms are known to be brittle to the choice of hyper-parameters therefore starting with an unlucky initialization may have an absolute effect on the end product since bad strategy means bad data generation. PBT increases the chance of success by having multiple simulations and strategies.

In 2020 Anyscale (a company known for their Python library Ray) published a new method similar to PBT called Population Based Bandits (PB2). Similar PBT, PB2 trains and optimizes a series of networks at the same time, allowing the optimal set-up to be quickly found. However, unlike PBT, PB2 uses a probabilistic model to guide the search in an efficient way, making it possible to discover high performing hyper-parameter configurations with far fewer agents than typically required by prior work. For more details about PB2 checkout Anyscale’s own document.

Bandit Based Search

PB2 was a bandit based search as well as population based. But what is “bandit” means in the context of probability theory ? The multi-armed bandit problem (sometimes called the N-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice’s properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.

Long story short we can think hyper-parameter optimization task as an infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. HyperBand -a bandit based algorithm- aims to optimize with a method called successive halving. Hyperband implements hyper-parameter optimization by sampling candidates at random and “trying” them first, running them for a specific budget. The approach is iterative, promising candidates are run for a longer time, increasing the fidelity for their performance. While this is a very efficient racing approach, random sampling makes no use of the knowledge gained about the candidates during optimization. Therefore I would still prefer PBT over HyperBand.

Mean Reward Distribution on Steps for Different Learning Rates (Example for Hyper-parameter Optimization on a Reinforcement Learning Task)

Summary of Key Findings

We talked about what is the hyper-parameter optimization and most common HPO approaches from simplest to the newest approach. The key is to know all of these approaches and use the most suitable one for you. As model selection doesn’t have one answer, HPO is also depends on the task and the resources itself. If parallel search enough for you and you can try combinations of parameter values as much as you want then go with it. We also see the advantages and disadvantages of all the approaches. This is important since you will choose your approach by considering these facts.

I personally like PBT since I specialized on reinforcement learning and most of my time I use Population Based Training for the simulations however it does take time and also it can only be used with same neural network architectures. Therefore for trying new architectures I try basically parallel search methods.

I quite enjoyed my time while writing this post and reading papers about hyper-parameter optimization. Thanks for your patience and being all along with me throughout the post. Also any feedback is welcome on comments.You can also buy me a coffee!😎

References

Sequential Search Related

Comprehensive analysis of gradient-based hyper-parameter optimization algorithms — Comparison of gradient-based algorithms vs random search.
A blog post for detail explanation of Bayesian Optimization
A Hands-On Bayesian Optimization implementation in Python from Scratch
Conceptual Explanation of Bayesian Optimization
I would also recommend you to read this case study article of a Credit Card Fraud company which uses Bayesian Optimization on their models.

Population Based Related

Bandit Based Related

Hyperband: A Novel Bandit-Based Approach to Hyper-parameter Optimization