OCDocker.Processing.Preprocessing.RmsdClustering module¶

Sets of classes and functions that are used to cluster molecules based on their rmsd.

Usage:

import OCDocker.Processing.Preprocessing.RmsdClustering as ocrmsdclust

OCDocker.Processing.Preprocessing.RmsdClustering.cluster_rmsd(data, algorithm='agglomerativeClustering', max_distance_threshold=20.0, min_distance_threshold=10.0, threshold_step=0.1, outputPlot='', molecule_name='', pose_engine_map=None, engine_colors=None)[source]¶

Cluster molecules based on their rmsd.

Parameters:

data (Union[Dict[str, Dict[str, float]], pd.DataFrame]) – The rmsd matrix.
algorithm (str, optional) – The clustering algorithm to be used. The default is ‘agglomerativeClustering’. The options are: ‘agglomerativeClustering’.
min_distance_threshold (float, optional) – The minimum distance threshold for the agglomerative clustering. The default is 10.0.
max_distance_threshold (float, optional) – The maximum distance threshold for the agglomerative clustering. The default is 20.0.
threshold_step (float, optional) – The step to perform the distance threshold search. The default is 0.1.
outputPlot (str, optional) – The path to the output plot. The default is “”. If it is “”, the plot is not saved.
molecule_name (str, optional) – The name of the molecule to include in the plot title. The default is “”.
pose_engine_map (Dict[str, str], optional) – Mapping from pose file paths to engine names (‘vina’, ‘smina’, ‘plants’). Used for coloring labels in the plot.
engine_colors (Dict[str, str], optional) – Mapping from engine names to colors. Default: {‘plants’: ‘green’, ‘vina’: ‘#9B59B6’, ‘smina’: ‘blue’}. Engine names should be lowercase.

Returns:

The clusters or the error code. IMPORTANT: The error code 751 means that the cluster could not determine any consensus among the poses. This means that the poses are too different from each other. In this case, the poses should be discarded.

Return type:

np.ndarray | int

OCDocker.Processing.Preprocessing.RmsdClustering.get_medoids(data, clusters, onlyBiggest=True)[source]¶

Get the medoids of the clusters.

Parameters:

data (Union[Dict[str, Dict[str, float]], pd.DataFrame]) – The rmsd matrix.
clusters (np.ndarray) – The clusters.
onlyBiggest (bool, optional) – If True, only the medoid of the biggest clusters are returned. The default is True.

Returns:

The paths to the medoids.

Return type:

List[str]