OCDocker.OCScore.Optimization.DNN module

Module with a helper to perform the optimization of the Neural Network parameters model using Optuna.

It is imported as:

import OCDocker.OCScore.Optimization.DNN as ocdnn

OCDocker.OCScore.Optimization.DNN.optimize_NN(df_path, storage_id, base_models_folder, data=None, storage='sqlite:///NN_optimization.db', use_pdb_train=True, no_scores=False, only_scores=False, use_PCA=False, best_ao_params=None, pca_type=80, pca_model='', encoder_dims=(16, 256), autoencoder=True, multiencoder=False, run_autoencoder_optimization=True, num_processes_autoencoder=8, total_trials_autoencoder=2000, run_NN_optimization=True, num_processes_NN=8, total_trials_NN=125, explained_variance=0.95, random_seed=42, load_if_exists=True, use_gpu=True, parallel_backend='joblib', verbose=False)[source]

Optimize the Neural Network using the given parameters.

Parameters:
  • df_path (str) – The path to the DataFrame.

  • storage_id (int) – The storage ID to use.

  • base_models_folder (str) – The base models folder to use.

  • data (dict, optional) – The data dictionary. Default is None (treated as empty dict). If not empty, the data dictionary will be used instead of loading the data. This is useful for multiprocessing to avoid loading the data multiple times.

  • storage (str, optional) – The storage to use. Default is “sqlite:///NN_optimization.db”.

  • use_pdb_train (bool, optional) – If True, use the PDBbind data for training. If False, use the DUDEz data for training. Default is True.

  • no_scores (bool, optional) – If True, don’t use the scoring functions for training. If False, use the scoring functions. Default is False. (Will override only_scores)

  • only_scores (bool, optional) – If True, only use the scoring functions for training. If False, use all the features. Default is False.

  • use_PCA (bool, optional) – If True, use PCA to reduce the number of features. If False, use all the features. Default is True.

  • best_ao_params (dict, optional) – The best autoencoder parameters. Default is None.

  • pca_type (int, optional) – The PCA type to use. Default is 80.

  • pca_model (Union[str, PCA], optional) – The PCA model to use. Default is “”.

  • autoencoder (bool, optional) – If True, use the autoencoder. If False, don’t use the autoencoder. Default is False.

  • multiencoder (bool, optional) – If True, use the multiencoder. If False, don’t use the multiencoder. Default is False.

  • run_autoencoder_optimization (bool, optional) – If True, run the autoencoder optimization. If False, don’t run the autoencoder optimization. Default is False.

  • num_processes_autoencoder (int, optional) – The number of processes to use for the autoencoder. Default is 8.

  • total_trials_autoencoder (int, optional) – The number of total trials to use for the autoencoder. Default is 2000.

  • run_NN_optimization (bool, optional) – If True, run the Neural Network optimization. If False, don’t run the Neural Network optimization. Default is True.

  • num_processes_NN (int, optional) – The number of processes to use for the Neural Network. Default is 8.

  • total_trials_NN (int, optional) – The number of trials to use for the Neural Network. Default is 1000.

  • explained_variance (float, optional) – The explained variance to use. Default is 0.95.

  • random_seed (int, optional) – The random seed to use. Default is 42.

  • load_if_exists (bool, optional) – If True, load the model if it exists. If False, don’t load the model if it exists. Default is True.

  • use_gpu (bool, optional) – If True, use the GPU. If False, don’t use the GPU. Default is True.

  • parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”. [ATTENTION] multiprocessing has shown to have some nasty bugs while testing this library. It is highly recommended to use joblib.

  • verbose (bool, optional) – If True, print the output. If False, don’t print the output. Default is False.

  • encoder_dims (tuple[int, int]) –

Raises:

ValueError – If the parallel backend is not “joblib” or “multiprocessing”.

Return type:

None

OCDocker.OCScore.Optimization.DNN.perform_ablation_study_NN(X_train, y_train, X_test, y_test, X_val, y_val, id, num_processes, encoder_params, best_params, random_seed, use_gpu, verbose, load_if_exists, study_name, storage, masks=None, output_size=1, parallel_backend='joblib', n_jobs=1)[source]

Perform the ablation study for the Neural Network.

Parameters:
  • X_train (pd.DataFrame) – The training data.

  • y_train (pd.Series) – The training labels.

  • X_test (pd.DataFrame) – The testing data.

  • y_test (pd.Series) – The testing labels.

  • X_val (pd.DataFrame) – The validation data.

  • y_val (pd.Series) – The validation labels.

  • id (int) – The ID of the study.

  • num_processes (int) – The number of processes to use.

  • encoder_params (dict) – The encoder parameters.

  • best_params (dict) – The best parameters.

  • random_seed (int) – The random seed.

  • use_gpu (bool) – If True, use the GPU.

  • verbose (bool) – If True, print the output.

  • load_if_exists (bool) – If True, load the model if it exists.

  • study_name (str) – The study name.

  • storage (str) – The storage to use.

  • masks (list[], optional) – List of masks to be applied. If empty, all masks for scoring functions will be generated and used. This option is useful for splitting ablation in multiple computers. The default is None (treated as empty list).

  • output_size (int, optional) – The output size. Default is 1.

  • parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”. [ATTENTION] multiprocessing has shown to have some nasty bugs while testing this library. It is highly recommended to use joblib.

  • n_jobs (int, optional) – The number of jobs to use. Default is 1.

Raises:

ValueError – If the parallel backend is not “joblib” or “multiprocessing”.

Return type:

None

OCDocker.OCScore.Optimization.DNN.perform_seed_ablation_study_NN(X_train, y_train, X_test, y_test, X_val, y_val, id, num_processes, encoder_params, best_params, use_gpu, verbose, load_if_exists, study_name, storage, mask, seeds=None, output_size=1, parallel_backend='joblib', n_jobs=1)[source]

Perform the ablation study for the Neural Network.

Parameters:
  • X_train (np.ndarray) – The training data.

  • y_train (np.ndarray) – The training labels.

  • X_test (np.ndarray) – The testing data.

  • y_test (np.ndarray) – The testing labels.

  • X_val (np.ndarray) – The validation data.

  • y_val (np.ndarray) – The validation labels.

  • id (int) – The ID of the study.

  • num_processes (int) – The number of processes to use.

  • encoder_params (dict) – The encoder parameters.

  • best_params (dict) – The best parameters.

  • use_gpu (bool) – If True, use the GPU.

  • verbose (bool) – If True, print the output.

  • load_if_exists (bool) – If True, load the model if it exists.

  • study_name (str) – The study name.

  • storage (str) – The storage to use.

  • mask (np.ndarray) – The mask to be applied.

  • seeds (list) – List of seeds to be applied. If empty, all seeds for scoring functions will be generated and used. This option is useful for splitting ablation in multiple computers. If empty, all seeds from 0 to 1000 will be used. The default is None (treated as empty list).

  • output_size (int, optional) – The output size. Default is 1.

  • parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”. [ATTENTION] multiprocessing has shown to have some nasty bugs while testing this library. It is highly recommended to use joblib.

  • n_jobs (int, optional) – The number of jobs to use. Default is 1.

Raises:

ValueError – If the parallel backend is not “joblib” or “multiprocessing”.

Return type:

None

OCDocker.OCScore.Optimization.DNN.optimize(df_path, storage_id, base_models_folder, data=None, storage='sqlite:///NN_optimization.db', use_pdb_train=True, no_scores=False, only_scores=False, use_PCA=False, best_ao_params=None, pca_type=80, pca_model='', encoder_dims=(16, 256), autoencoder=True, multiencoder=False, run_autoencoder_optimization=True, num_processes_autoencoder=8, total_trials_autoencoder=2000, run_NN_optimization=True, num_processes_NN=8, total_trials_NN=125, explained_variance=0.95, random_seed=42, load_if_exists=True, use_gpu=True, parallel_backend='joblib', verbose=False)

Optimize the Neural Network using the given parameters.

Parameters:
  • df_path (str) – The path to the DataFrame.

  • storage_id (int) – The storage ID to use.

  • base_models_folder (str) – The base models folder to use.

  • data (dict, optional) – The data dictionary. Default is None (treated as empty dict). If not empty, the data dictionary will be used instead of loading the data. This is useful for multiprocessing to avoid loading the data multiple times.

  • storage (str, optional) – The storage to use. Default is “sqlite:///NN_optimization.db”.

  • use_pdb_train (bool, optional) – If True, use the PDBbind data for training. If False, use the DUDEz data for training. Default is True.

  • no_scores (bool, optional) – If True, don’t use the scoring functions for training. If False, use the scoring functions. Default is False. (Will override only_scores)

  • only_scores (bool, optional) – If True, only use the scoring functions for training. If False, use all the features. Default is False.

  • use_PCA (bool, optional) – If True, use PCA to reduce the number of features. If False, use all the features. Default is True.

  • best_ao_params (dict, optional) – The best autoencoder parameters. Default is None.

  • pca_type (int, optional) – The PCA type to use. Default is 80.

  • pca_model (Union[str, PCA], optional) – The PCA model to use. Default is “”.

  • autoencoder (bool, optional) – If True, use the autoencoder. If False, don’t use the autoencoder. Default is False.

  • multiencoder (bool, optional) – If True, use the multiencoder. If False, don’t use the multiencoder. Default is False.

  • run_autoencoder_optimization (bool, optional) – If True, run the autoencoder optimization. If False, don’t run the autoencoder optimization. Default is False.

  • num_processes_autoencoder (int, optional) – The number of processes to use for the autoencoder. Default is 8.

  • total_trials_autoencoder (int, optional) – The number of total trials to use for the autoencoder. Default is 2000.

  • run_NN_optimization (bool, optional) – If True, run the Neural Network optimization. If False, don’t run the Neural Network optimization. Default is True.

  • num_processes_NN (int, optional) – The number of processes to use for the Neural Network. Default is 8.

  • total_trials_NN (int, optional) – The number of trials to use for the Neural Network. Default is 1000.

  • explained_variance (float, optional) – The explained variance to use. Default is 0.95.

  • random_seed (int, optional) – The random seed to use. Default is 42.

  • load_if_exists (bool, optional) – If True, load the model if it exists. If False, don’t load the model if it exists. Default is True.

  • use_gpu (bool, optional) – If True, use the GPU. If False, don’t use the GPU. Default is True.

  • parallel_backend (str, optional) – The parallel backend to use. The default is “joblib”. Options are “joblib” and “multiprocessing”. [ATTENTION] multiprocessing has shown to have some nasty bugs while testing this library. It is highly recommended to use joblib.

  • verbose (bool, optional) – If True, print the output. If False, don’t print the output. Default is False.

  • encoder_dims (tuple[int, int]) –

Raises:

ValueError – If the parallel backend is not “joblib” or “multiprocessing”.

Return type:

None