OCDocker.OCScore.DNN.future.DNNOptimizer module

Module to perform the optimization of the future Neural Network pipeline.

It is imported as:

from OCDocker.OCScore.DNN.future.DNNOptimizer import DNNOptimizer

class OCDocker.OCScore.DNN.future.DNNOptimizer.DNNOptimizer(X_train, y_train, X_test, y_test, X_validation=None, y_validation=None, mask=None, storage='sqlite:///NNoptimization.db', encoder_params=None, output_size=1, random_seed=42, use_gpu=True, verbose=False, future_config=None)[source]

Bases: object

Future DNN optimizer with multi-stage, multi-task training.

Parameters:
  • X_train (np.ndarray | pd.DataFrame) – Primary regression training features.

  • y_train (np.ndarray | pd.Series) – Primary regression training targets.

  • X_test (np.ndarray | pd.DataFrame) – Primary regression testing features.

  • y_test (np.ndarray | pd.Series) – Primary regression testing targets.

  • X_validation (np.ndarray | pd.DataFrame | None, optional) – Ranking/classification features (if provided). Default is None.

  • y_validation (np.ndarray | pd.Series | None, optional) – Ranking/classification labels (if provided). Default is None.

  • mask (list[int|bool] | np.ndarray, optional) – Feature mask for ablation/feature selection.

  • storage (str, optional) – Optuna storage string.

  • encoder_params (dict | None, optional) – Encoder params (old-style).

  • output_size (int, optional) – Output size (kept for compatibility). Default is 1.

  • random_seed (int, optional) – Random seed. Default is 42.

  • use_gpu (bool, optional) – Use GPU if available. Default is True.

  • verbose (bool, optional) – Verbose mode. Default is False.

  • future_config (dict | None, optional) – Configuration overrides for future pipeline.

Notes

The training flow is split into two stages: - stage1: regression + optional reconstruction (pretraining on continuous targets). - stage2: ranking/classification with optional energy/reconstruction regularization.

Data Flow

  • Regression data: (X_train, y_train) are used for stage1 energy regression. (X_test, y_test) are used as the stage1 validation split.

  • Ranking data: (X_validation, y_validation) are used to build stage2 train/val loaders (grouped by target ids). If these are missing, stage2 cannot run.

  • Target grouping: if X_validation is a DataFrame and contains the column given by data.ranking_target_column (default “receptor”), that column is used as target ids. Otherwise, future_config[“ranking_targets”] can supply them. If still unavailable, all samples are treated as one target.

  • Custom splits: future_config may provide pre-split dictionaries “ranking_train_data” and “ranking_val_data” with keys {X, y, targets}.

Clarification

The reconstruction head in stage1 is only a regularizer. It does not replace the standalone Autoencoder pipeline and is not a dimensionality reduction step by itself. If you want explicit dimensionality reduction, run it upstream (e.g., PCA/AE) and pass the resulting embeddings as X, setting lambda_recon=0 to disable reconstruction.

Configuration

The future_config dict is merged into the defaults using keys below:

model
  • shared_sizeslist[int]

    Hidden sizes for the shared encoder (used when encoder_params is None).

  • shared_activationstr

    Activation for shared encoder layers.

  • decoder_sizeslist[int] | None

    Reconstruction decoder sizes; if None and recon loss enabled, a mirrored decoder is built automatically.

  • head_sizeslist[int]

    Hidden sizes for energy/activity heads.

  • embedding_dimint | None

    Output size for embedding head (None disables).

  • dropoutfloat

    Dropout probability for encoder/heads.

  • batch_normbool

    Whether to use BatchNorm1d.

stage1
  • enabledbool

    Whether to run stage1 training.

  • epochsint

    Number of training epochs.

  • batch_sizeint

    Batch size for regression data.

  • lr, weight_decayfloat

    Optimizer hyperparameters.

  • lambda_recon, lambda_energyfloat

    Weights for reconstruction and energy losses.

  • energy_lossstr

    Energy loss type (“mse” or “huber”).

  • noise_typestr

    Input noise type (“mask”, “gaussian”, or “none”).

  • mask_prob, gaussian_stdfloat

    Noise parameters.

  • clip_gradfloat

    Gradient clipping max-norm (0 disables).

  • early_stopping_patienceint

    Stop after this many epochs without improvement.

stage2
  • enabledbool

    Whether to run stage2 ranking/classification.

  • epochsint

    Number of training epochs.

  • batch_size_per_targetint | None

    Optional per-target batch size for ranking.

  • split_target_batchesbool

    Whether to split large targets into multiple batches.

  • lr, weight_decayfloat

    Optimizer hyperparameters.

  • lambda_rank, lambda_cls, lambda_confloat

    Weights for ranking, classification, and contrastive losses.

  • lambda_energy, lambda_reconfloat

    Optional regularizers from regression data.

  • rank_k_fractionsSequence[float]

    Top-k fractions for LambdaRank weighting and ranking metrics.

  • rank_weightsSequence[float]

    Weights per k fraction (same length as rank_k_fractions).

  • temperaturefloat

    Temperature for contrastive loss.

  • clip_gradfloat

    Gradient clipping max-norm (0 disables).

  • use_focalbool

    Use focal loss instead of BCE for classification.

  • focal_alpha, focal_gammafloat

    Focal loss parameters.

  • bce_pos_weightfloat | None

    Optional positive class weight for BCE.

  • early_stopping_patienceint

    Stop after this many epochs without improvement.

  • energy_batch_ratiofloat

    Ratio of regression batches used in stage2 regularization.

optimization
  • loss_balancingstr

    “fixed” or “uncertainty” (learns task weights).

  • metric_for_beststr

    Metric key used to track best validation model.

  • multi_objectivebool

    Whether to return a multi-objective tuple to Optuna.

  • objective_metricstr

    Metric key to optimize when not multi-objective.

data
  • ranking_validation_fractionfloat

    Fraction of ranking data held out for validation.

  • ranking_split_by_targetbool

    If True, split by target ids (GroupShuffleSplit).

  • ranking_target_columnstr

    Column name in X_validation (DataFrame) containing target ids.

Example

>>> trainer = DNNOptimizer(X_train, y_train, X_test, y_test, X_validation, y_validation)
>>> trainer.optimize(n_trials=5)
>>> # AE -> DNN pipeline with precomputed embeddings
>>> import torch
>>> from OCDocker.OCScore.Dimensionality.future.Autoencoder import Autoencoder
>>> ae = Autoencoder(input_size=20, encoder_hidden_sizes=[32, 16], latent_dim=8, energy_head_sizes=None)
>>> with torch.no_grad():
...     Z_train = ae.encode(torch.tensor(X_train, dtype=torch.float32)).cpu().numpy()
...     Z_test = ae.encode(torch.tensor(X_test, dtype=torch.float32)).cpu().numpy()
>>> dnn = DNNOptimizer.from_embeddings(
...     Z_train, y_train, Z_test, y_test,
...     future_config={"stage1": {"lambda_recon": 0.0, "noise_type": "none"}, "stage2": {"enabled": False}}
... )
>>> dnn.optimize(n_trials=1)
__init__(X_train, y_train, X_test, y_test, X_validation=None, y_validation=None, mask=None, storage='sqlite:///NNoptimization.db', encoder_params=None, output_size=1, random_seed=42, use_gpu=True, verbose=False, future_config=None)[source]

Initialize the future DNN optimizer.

Parameters:
  • X_train (np.ndarray | pd.DataFrame) – Primary regression training features.

  • y_train (np.ndarray | pd.Series) – Primary regression training targets.

  • X_test (np.ndarray | pd.DataFrame) – Primary regression testing features.

  • y_test (np.ndarray | pd.Series) – Primary regression testing targets.

  • X_validation (np.ndarray | pd.DataFrame | None, optional) – Ranking/classification features, by default None.

  • y_validation (np.ndarray | pd.Series | None, optional) – Ranking/classification labels, by default None.

  • mask (list[int | bool] | np.ndarray, optional) – Feature mask, by default [].

  • storage (str, optional) – Optuna storage string, by default “sqlite:///NNoptimization.db”.

  • encoder_params (dict | None, optional) – Legacy encoder parameters, by default None.

  • output_size (int, optional) – Output size (compat), by default 1.

  • random_seed (int, optional) – Random seed, by default 42.

  • use_gpu (bool, optional) – Use GPU if available, by default True.

  • verbose (bool, optional) – Verbose mode, by default False.

  • future_config (dict | None, optional) – Configuration overrides, by default None.

Return type:

None

mask: torch.Tensor | None
rank_train: Dict[str, Any] | None
rank_val: Dict[str, Any] | None
classmethod from_embeddings(X_embeddings_train, y_train, X_embeddings_test, y_test, X_embeddings_validation=None, y_validation=None, **kwargs)[source]

Construct a DNNOptimizer from precomputed embeddings.

Parameters:
  • X_embeddings_train (np.ndarray | pd.DataFrame) – Training embeddings (output of a dimensionality reducer).

  • y_train (np.ndarray | pd.Series) – Training regression targets.

  • X_embeddings_test (np.ndarray | pd.DataFrame) – Test embeddings.

  • y_test (np.ndarray | pd.Series) – Test regression targets.

  • X_embeddings_validation (np.ndarray | pd.DataFrame | None, optional) – Optional ranking/classification embeddings, by default None.

  • y_validation (np.ndarray | pd.Series | None, optional) – Optional ranking/classification labels, by default None.

  • **kwargs (Any) – Additional keyword arguments forwarded to DNNOptimizer.

Returns:

Configured optimizer instance.

Return type:

DNNOptimizer

Notes

When using embeddings, set stage1.lambda_recon=0 and noise_type=”none” to avoid reconstructing already-reduced features.

objective(trial)[source]

Objective function for Optuna.

Parameters:

trial (optuna.Trial) – Optuna trial.

Returns:

Optimization objective value(s).

Return type:

float | tuple

optimize(direction='maximize', n_trials=10, study_name='NN_Future_Optimization', load_if_exists=True, sampler=optuna.samplers.TPESampler, n_jobs=1)[source]

Optimize the future pipeline using Optuna.

Parameters:
  • direction (str, optional) – Direction of optimization (ignored if multi-objective). Default is “maximize”.

  • n_trials (int, optional) – Number of trials. Default is 10.

  • study_name (str, optional) – Study name. Default is “NN_Future_Optimization”.

  • load_if_exists (bool, optional) – Load existing study. Default True.

  • sampler (optuna.samplers.BaseSampler, optional) – Optuna sampler. Default TPESampler().

  • n_jobs (int, optional) – Number of parallel jobs. Default 1.

Returns:

Optuna study object.

Return type:

optuna.study.Study

set_random_seed()[source]

Set the random seed for reproducibility.

Return type:

None