OCDocker.OCScore.Optimization.XGBoost module¶

Module with a helper to perform the optimization of the Extreme Gradient Boost (XGBoost) parameters model using Optuna.

It is imported as:

import OCDocker.OCScore.Optimization.XGBoost as ocxgb

OCDocker.OCScore.Optimization.XGBoost.optimize_XGB(df_path, storage_id, base_models_folder, data=None, storage='sqlite:///XGB_optimization.db', use_pdb_train=True, no_scores=False, only_scores=False, use_PCA=False, pca_type=95, pca_model='', run_pre_XGB_optimization=True, num_processes_pre_XGB=8, total_trials_pre_XGB=250, run_GA_optimization=False, num_processes_GA=8, total_trials_GA=10, run_XGB_optimization=True, num_processes_XGB=8, total_trials_XGB=10, early_stopping_rounds=20, random_seed=42, load_if_exists=True, use_gpu=True, parallel_backend='joblib', verbose=False)[source]¶

Optimize the Extreme Gradient Boost using the given parameters.

Parameters:

df_path (str) – The path to the DataFrame.
storage_id (int) – The storage ID to use.
base_models_folder (str) – The base models folder to use.
data (dict, optional) – The data dictionary. Default is None (treated as empty dict). If not empty, the data dictionary will be used instead of loading the data. This is useful for multiprocessing to avoid loading the data multiple times.
storage (str, optional) – The storage to use. Default is “sqlite:///XGB_optimization.db”.
use_pdb_train (bool, optional) – If True, use the PDBbind data for training. If False, use the DUDEz data for training. Default is True.
no_scores (bool, optional) – If True, don’t use the scoring functions for training. If False, use the scoring functions. Default is False. (Will override only_scores if True)
only_scores (bool, optional) – If True, only use the scoring functions for training. If False, use all the features. Default is True.
use_PCA (bool, optional) – If True, use PCA to reduce the number of features. If False, use all the features. Default is True.
pca_type (int, optional) – The PCA type to use. Default is 80.
pca_model (Union[str, PCA], optional) – The PCA model to use. Default is “”.
num_processes (int, optional) – The number of processes to use. Default is 8.
run_pre_XGB_optimization (bool, optional) – If True, run the pre-XGBoost optimization. If False, don’t run the pre-XGBoost optimization. Default is False.
num_processes_pre_XGB (int, optional) – The number of processes to use for the pre-XGBoost optimization. Default is 8.
n_trials_pre_XGB (int, optional) – The number of trials to use for the pre-XGBoost optimization. Default is 250.
run_GA_optimization (bool, optional) – If True, run the Genetic Algorithm optimization. If False, don’t run the Genetic Algorithm optimization. Default is False.
num_processes_GA (int, optional) – The number of processes to use for the Genetic Algorithm optimization. Default is 8.
run_XGB_optimization (bool, optional) – If True, run the Neural Network optimization. If False, don’t run the Neural Network optimization. Default is True.
random_seed (int, optional) – The random seed to use. Default is 42.
load_if_exists (bool, optional) – If True, load the model if it exists. If False, don’t load the model if it exists. Default is True.
use_gpu (bool, optional) – If True, use the GPU. If False, don’t use the GPU. Default is True.
parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”.
verbose (bool, optional) – If True, print out more information. If False, print out less information. Default is False.
total_trials_pre_XGB (int) –
total_trials_GA (int) –
num_processes_XGB (int) –
total_trials_XGB (int) –
early_stopping_rounds (int) –

Return type:

None

OCDocker.OCScore.Optimization.XGBoost.optimize(df_path, storage_id, base_models_folder, data=None, storage='sqlite:///XGB_optimization.db', use_pdb_train=True, no_scores=False, only_scores=False, use_PCA=False, pca_type=95, pca_model='', run_pre_XGB_optimization=True, num_processes_pre_XGB=8, total_trials_pre_XGB=250, run_GA_optimization=False, num_processes_GA=8, total_trials_GA=10, run_XGB_optimization=True, num_processes_XGB=8, total_trials_XGB=10, early_stopping_rounds=20, random_seed=42, load_if_exists=True, use_gpu=True, parallel_backend='joblib', verbose=False)¶

Optimize the Extreme Gradient Boost using the given parameters.

Parameters:

df_path (str) – The path to the DataFrame.
storage_id (int) – The storage ID to use.
base_models_folder (str) – The base models folder to use.
data (dict, optional) – The data dictionary. Default is None (treated as empty dict). If not empty, the data dictionary will be used instead of loading the data. This is useful for multiprocessing to avoid loading the data multiple times.
storage (str, optional) – The storage to use. Default is “sqlite:///XGB_optimization.db”.
use_pdb_train (bool, optional) – If True, use the PDBbind data for training. If False, use the DUDEz data for training. Default is True.
no_scores (bool, optional) – If True, don’t use the scoring functions for training. If False, use the scoring functions. Default is False. (Will override only_scores if True)
only_scores (bool, optional) – If True, only use the scoring functions for training. If False, use all the features. Default is True.
use_PCA (bool, optional) – If True, use PCA to reduce the number of features. If False, use all the features. Default is True.
pca_type (int, optional) – The PCA type to use. Default is 80.
pca_model (Union[str, PCA], optional) – The PCA model to use. Default is “”.
num_processes (int, optional) – The number of processes to use. Default is 8.
run_pre_XGB_optimization (bool, optional) – If True, run the pre-XGBoost optimization. If False, don’t run the pre-XGBoost optimization. Default is False.
num_processes_pre_XGB (int, optional) – The number of processes to use for the pre-XGBoost optimization. Default is 8.
n_trials_pre_XGB (int, optional) – The number of trials to use for the pre-XGBoost optimization. Default is 250.
run_GA_optimization (bool, optional) – If True, run the Genetic Algorithm optimization. If False, don’t run the Genetic Algorithm optimization. Default is False.
num_processes_GA (int, optional) – The number of processes to use for the Genetic Algorithm optimization. Default is 8.
run_XGB_optimization (bool, optional) – If True, run the Neural Network optimization. If False, don’t run the Neural Network optimization. Default is True.
random_seed (int, optional) – The random seed to use. Default is 42.
load_if_exists (bool, optional) – If True, load the model if it exists. If False, don’t load the model if it exists. Default is True.
use_gpu (bool, optional) – If True, use the GPU. If False, don’t use the GPU. Default is True.
parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”.
verbose (bool, optional) – If True, print out more information. If False, print out less information. Default is False.
total_trials_pre_XGB (int) –
total_trials_GA (int) –
num_processes_XGB (int) –
total_trials_XGB (int) –
early_stopping_rounds (int) –

Return type:

None