OCDocker.OCScore.DNN.future.DNNOptimizer module¶
Module to perform the optimization of the future Neural Network pipeline.
It is imported as:
from OCDocker.OCScore.DNN.future.DNNOptimizer import DNNOptimizer
- class OCDocker.OCScore.DNN.future.DNNOptimizer.DNNOptimizer(X_train, y_train, X_test, y_test, X_validation=None, y_validation=None, mask=None, storage='sqlite:///NNoptimization.db', encoder_params=None, output_size=1, random_seed=42, use_gpu=True, verbose=False, future_config=None)[source]¶
Bases:
objectFuture DNN optimizer with multi-stage, multi-task training.
- Parameters:
X_train (np.ndarray | pd.DataFrame) – Primary regression training features.
y_train (np.ndarray | pd.Series) – Primary regression training targets.
X_test (np.ndarray | pd.DataFrame) – Primary regression testing features.
y_test (np.ndarray | pd.Series) – Primary regression testing targets.
X_validation (np.ndarray | pd.DataFrame | None, optional) – Ranking/classification features (if provided). Default is None.
y_validation (np.ndarray | pd.Series | None, optional) – Ranking/classification labels (if provided). Default is None.
mask (list[int|bool] | np.ndarray, optional) – Feature mask for ablation/feature selection.
storage (str, optional) – Optuna storage string.
encoder_params (dict | None, optional) – Encoder params (old-style).
output_size (int, optional) – Output size (kept for compatibility). Default is 1.
random_seed (int, optional) – Random seed. Default is 42.
use_gpu (bool, optional) – Use GPU if available. Default is True.
verbose (bool, optional) – Verbose mode. Default is False.
future_config (dict | None, optional) – Configuration overrides for future pipeline.
Notes
The training flow is split into two stages: - stage1: regression + optional reconstruction (pretraining on continuous targets). - stage2: ranking/classification with optional energy/reconstruction regularization.
Data Flow¶
Regression data: (X_train, y_train) are used for stage1 energy regression. (X_test, y_test) are used as the stage1 validation split.
Ranking data: (X_validation, y_validation) are used to build stage2 train/val loaders (grouped by target ids). If these are missing, stage2 cannot run.
Target grouping: if X_validation is a DataFrame and contains the column given by data.ranking_target_column (default “receptor”), that column is used as target ids. Otherwise, future_config[“ranking_targets”] can supply them. If still unavailable, all samples are treated as one target.
Custom splits: future_config may provide pre-split dictionaries “ranking_train_data” and “ranking_val_data” with keys {X, y, targets}.
Clarification¶
The reconstruction head in stage1 is only a regularizer. It does not replace the standalone Autoencoder pipeline and is not a dimensionality reduction step by itself. If you want explicit dimensionality reduction, run it upstream (e.g., PCA/AE) and pass the resulting embeddings as X, setting lambda_recon=0 to disable reconstruction.
Configuration¶
The future_config dict is merged into the defaults using keys below:
- model
- shared_sizeslist[int]
Hidden sizes for the shared encoder (used when encoder_params is None).
- shared_activationstr
Activation for shared encoder layers.
- decoder_sizeslist[int] | None
Reconstruction decoder sizes; if None and recon loss enabled, a mirrored decoder is built automatically.
- head_sizeslist[int]
Hidden sizes for energy/activity heads.
- embedding_dimint | None
Output size for embedding head (None disables).
- dropoutfloat
Dropout probability for encoder/heads.
- batch_normbool
Whether to use BatchNorm1d.
- stage1
- enabledbool
Whether to run stage1 training.
- epochsint
Number of training epochs.
- batch_sizeint
Batch size for regression data.
- lr, weight_decayfloat
Optimizer hyperparameters.
- lambda_recon, lambda_energyfloat
Weights for reconstruction and energy losses.
- energy_lossstr
Energy loss type (“mse” or “huber”).
- noise_typestr
Input noise type (“mask”, “gaussian”, or “none”).
- mask_prob, gaussian_stdfloat
Noise parameters.
- clip_gradfloat
Gradient clipping max-norm (0 disables).
- early_stopping_patienceint
Stop after this many epochs without improvement.
- stage2
- enabledbool
Whether to run stage2 ranking/classification.
- epochsint
Number of training epochs.
- batch_size_per_targetint | None
Optional per-target batch size for ranking.
- split_target_batchesbool
Whether to split large targets into multiple batches.
- lr, weight_decayfloat
Optimizer hyperparameters.
- lambda_rank, lambda_cls, lambda_confloat
Weights for ranking, classification, and contrastive losses.
- lambda_energy, lambda_reconfloat
Optional regularizers from regression data.
- rank_k_fractionsSequence[float]
Top-k fractions for LambdaRank weighting and ranking metrics.
- rank_weightsSequence[float]
Weights per k fraction (same length as rank_k_fractions).
- temperaturefloat
Temperature for contrastive loss.
- clip_gradfloat
Gradient clipping max-norm (0 disables).
- use_focalbool
Use focal loss instead of BCE for classification.
- focal_alpha, focal_gammafloat
Focal loss parameters.
- bce_pos_weightfloat | None
Optional positive class weight for BCE.
- early_stopping_patienceint
Stop after this many epochs without improvement.
- energy_batch_ratiofloat
Ratio of regression batches used in stage2 regularization.
- optimization
- loss_balancingstr
“fixed” or “uncertainty” (learns task weights).
- metric_for_beststr
Metric key used to track best validation model.
- multi_objectivebool
Whether to return a multi-objective tuple to Optuna.
- objective_metricstr
Metric key to optimize when not multi-objective.
- data
- ranking_validation_fractionfloat
Fraction of ranking data held out for validation.
- ranking_split_by_targetbool
If True, split by target ids (GroupShuffleSplit).
- ranking_target_columnstr
Column name in X_validation (DataFrame) containing target ids.
Example
>>> trainer = DNNOptimizer(X_train, y_train, X_test, y_test, X_validation, y_validation) >>> trainer.optimize(n_trials=5) >>> # AE -> DNN pipeline with precomputed embeddings >>> import torch >>> from OCDocker.OCScore.Dimensionality.future.Autoencoder import Autoencoder >>> ae = Autoencoder(input_size=20, encoder_hidden_sizes=[32, 16], latent_dim=8, energy_head_sizes=None) >>> with torch.no_grad(): ... Z_train = ae.encode(torch.tensor(X_train, dtype=torch.float32)).cpu().numpy() ... Z_test = ae.encode(torch.tensor(X_test, dtype=torch.float32)).cpu().numpy() >>> dnn = DNNOptimizer.from_embeddings( ... Z_train, y_train, Z_test, y_test, ... future_config={"stage1": {"lambda_recon": 0.0, "noise_type": "none"}, "stage2": {"enabled": False}} ... ) >>> dnn.optimize(n_trials=1)
- __init__(X_train, y_train, X_test, y_test, X_validation=None, y_validation=None, mask=None, storage='sqlite:///NNoptimization.db', encoder_params=None, output_size=1, random_seed=42, use_gpu=True, verbose=False, future_config=None)[source]¶
Initialize the future DNN optimizer.
- Parameters:
X_train (np.ndarray | pd.DataFrame) – Primary regression training features.
y_train (np.ndarray | pd.Series) – Primary regression training targets.
X_test (np.ndarray | pd.DataFrame) – Primary regression testing features.
y_test (np.ndarray | pd.Series) – Primary regression testing targets.
X_validation (np.ndarray | pd.DataFrame | None, optional) – Ranking/classification features, by default None.
y_validation (np.ndarray | pd.Series | None, optional) – Ranking/classification labels, by default None.
mask (list[int | bool] | np.ndarray, optional) – Feature mask, by default [].
storage (str, optional) – Optuna storage string, by default “sqlite:///NNoptimization.db”.
encoder_params (dict | None, optional) – Legacy encoder parameters, by default None.
output_size (int, optional) – Output size (compat), by default 1.
random_seed (int, optional) – Random seed, by default 42.
use_gpu (bool, optional) – Use GPU if available, by default True.
verbose (bool, optional) – Verbose mode, by default False.
future_config (dict | None, optional) – Configuration overrides, by default None.
- Return type:
None
- mask: torch.Tensor | None¶
- rank_train: Dict[str, Any] | None¶
- rank_val: Dict[str, Any] | None¶
- classmethod from_embeddings(X_embeddings_train, y_train, X_embeddings_test, y_test, X_embeddings_validation=None, y_validation=None, **kwargs)[source]¶
Construct a DNNOptimizer from precomputed embeddings.
- Parameters:
X_embeddings_train (np.ndarray | pd.DataFrame) – Training embeddings (output of a dimensionality reducer).
y_train (np.ndarray | pd.Series) – Training regression targets.
X_embeddings_test (np.ndarray | pd.DataFrame) – Test embeddings.
y_test (np.ndarray | pd.Series) – Test regression targets.
X_embeddings_validation (np.ndarray | pd.DataFrame | None, optional) – Optional ranking/classification embeddings, by default None.
y_validation (np.ndarray | pd.Series | None, optional) – Optional ranking/classification labels, by default None.
**kwargs (Any) – Additional keyword arguments forwarded to DNNOptimizer.
- Returns:
Configured optimizer instance.
- Return type:
Notes
When using embeddings, set stage1.lambda_recon=0 and noise_type=”none” to avoid reconstructing already-reduced features.
- objective(trial)[source]¶
Objective function for Optuna.
- Parameters:
trial (optuna.Trial) – Optuna trial.
- Returns:
Optimization objective value(s).
- Return type:
float | tuple
- optimize(direction='maximize', n_trials=10, study_name='NN_Future_Optimization', load_if_exists=True, sampler=optuna.samplers.TPESampler, n_jobs=1)[source]¶
Optimize the future pipeline using Optuna.
- Parameters:
direction (str, optional) – Direction of optimization (ignored if multi-objective). Default is “maximize”.
n_trials (int, optional) – Number of trials. Default is 10.
study_name (str, optional) – Study name. Default is “NN_Future_Optimization”.
load_if_exists (bool, optional) – Load existing study. Default True.
sampler (optuna.samplers.BaseSampler, optional) – Optuna sampler. Default TPESampler().
n_jobs (int, optional) – Number of parallel jobs. Default 1.
- Returns:
Optuna study object.
- Return type:
optuna.study.Study