OCDocker.OCScore.DNN.future.datasets module

Datasets and samplers for the future DNN pipeline.

class OCDocker.OCScore.DNN.future.datasets.EnergyDataset(*args, **kwargs)[source]

Bases: Dataset

Dataset for regression targets (energy labels).

Parameters:
  • features (np.ndarray) – Input features.

  • energies (np.ndarray) – Regression targets (e.g., energies).

  • mask (np.ndarray | None, optional) – Feature mask for single-branch inputs.

Notes

Returns (features, energy) where energy has shape (1,).

Examples

>>> import numpy as np
>>> from OCDocker.OCScore.DNN.future.datasets import EnergyDataset
>>> features = np.random.rand(100, 20)  # 100 samples, 20 features each
>>> energies = np.random.rand(100)      # 100 energy labels
>>> mask = np.random.randint(0, 2, size=(100, 20))  # Random binary mask
>>> dataset = EnergyDataset(features, energies, mask)
>>> sample_features, sample_energy = dataset[0]
>>> print(sample_features.shape)  # torch.Size([20])
>>> print(sample_energy.shape)    # torch.Size([1])
__getitem__(idx)[source]

Return a dataset sample.

Parameters:

idx (int) – Sample index.

Returns:

Features and energy target tensors.

Return type:

tuple

__init__(features, energies, mask=None)[source]

Initialize energy dataset.

Parameters:
  • features (np.ndarray) – Input features.

  • energies (np.ndarray) – Energy targets.

  • mask (np.ndarray | None, optional) – Feature mask, by default None.

Return type:

None

__len__()[source]

Return dataset length.

Returns:

Number of samples.

Return type:

int

class OCDocker.OCScore.DNN.future.datasets.TargetRankingDataset(*args, **kwargs)[source]

Bases: Dataset

Dataset for ranking with per-target grouping.

Parameters:
  • features (np.ndarray) – Input features.

  • labels (np.ndarray) – Binary labels (1 for active, 0 for decoy).

  • target_ids (Sequence[str]) – Target identifiers per sample (used for grouping).

  • mask (np.ndarray | None, optional) – Feature mask for single-branch inputs.

Notes

Returns (features, label, target_id) where target_id is an integer index. Target ids are stable based on first appearance order in target_ids.

__getitem__(idx)[source]

Return a dataset sample.

Parameters:

idx (int) – Sample index.

Returns:

Features, label, and target id.

Return type:

tuple

__init__(features, labels, target_ids, mask=None)[source]

Initialize target ranking dataset.

Parameters:
  • features (np.ndarray) – Input features.

  • labels (np.ndarray) – Binary labels.

  • target_ids (Sequence[str]) – Target identifiers.

  • mask (np.ndarray | None, optional) – Feature mask, by default None.

Return type:

None

__len__()[source]

Return dataset length.

Returns:

Number of samples.

Return type:

int

class OCDocker.OCScore.DNN.future.datasets.TargetBatchSampler(*args, **kwargs)[source]

Bases: List[int]

Sampler that yields batches grouped by target.

Parameters:
  • target_to_indices (dict[int, list[int]]) – Mapping from target id to list of indices.

  • batch_size (int | None, optional) – If provided, limits batch size per target. If None, uses full target.

  • shuffle (bool, optional) – Shuffle target order each epoch. Default True.

  • split_target_batches (bool, optional) – If True, split each target into multiple batches of size batch_size. If False, sample a single batch per target. Default False.

Notes

This sampler groups indices by target id to preserve per-target ranking structure during training.

Examples

>>> from OCDocker.OCScore.DNN.future.datasets import TargetBatchSampler
>>> target_to_indices = {
...     0: [0, 1, 2, 3],
...     1: [4, 5, 6],
...     2: [7, 8]
... }
>>> sampler = TargetBatchSampler(target_to_indices, batch_size=2, shuffle=False,
...                              split_target_batches=True)
>>> for batch in sampler:
...     print(batch)
[0, 1]
[2, 3]
[4, 5]
[6]
[7, 8]
__init__(target_to_indices, batch_size=None, shuffle=True, split_target_batches=False)[source]

Initialize target batch sampler.

Parameters:
  • target_to_indices (dict[int, list[int]]) – Mapping from target id to indices.

  • batch_size (int | None, optional) – Maximum batch size per target, by default None.

  • shuffle (bool, optional) – Shuffle targets each epoch, by default True.

  • split_target_batches (bool, optional) – Split targets into multiple batches, by default False.

Return type:

None

__iter__()[source]

Yield batches grouped by target.

Yields:

list[int] – Indices for a batch.

__len__()[source]

Return number of batches.

Returns:

Number of batches per epoch.

Return type:

int