OCDocker.OCScore.Utils.Data module¶
Set of functions to manage data processment in OCDocker in the context of scoring functions.
Usage:
import OCDocker.OCScore.Utils.Data as ocscoredata
- OCDocker.OCScore.Utils.Data.apply_pca(df, pca_model, columns_to_skip_pca=[], inplace=False)[source]¶
Applies PCA to a DataFrame using a pre-trained PCA model.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
pca_model (str) – Path to the pre-trained PCA model or the PCA model.
columns_to_skip_pca (list[str], optional) – List of columns to keep in the DataFrame before applying PCA. The default is [].
inplace (bool, optional) – If True, the original DataFrame is modified. If False, a new DataFrame is returned. The default is False.
- Returns:
DataFrame with PCA applied if inplace is False. None if inplace is True.
- Return type:
pd.DataFrame or None
- Raises:
FileNotFoundError – If the PCA model path is not found.
TypeError – If the PCA model type is invalid. Must be a string or a PCA model.
- OCDocker.OCScore.Utils.Data.calculate_metrics(df, selected_columns)[source]¶
Calculates additional metrics for a DataFrame. The metrics include average, median, maximum, minimum, standard deviation, variance, sum, range, 25th and 75th percentiles.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
selected_columns (list) – List of columns to calculate metrics for.
- Returns:
pd.DataFrame – DataFrame with additional metrics.
list – List of additional metrics column names.
- Return type:
tuple[DataFrame, list]
- OCDocker.OCScore.Utils.Data.chunkenize_dataset(data, id, num_machines)[source]¶
Split a dataset in multiple chunks.
- Parameters:
data (list[Any] | np.ndarray | pd.DataFrame) – The dataset to split (can be a list, numpy array, or pandas DataFrame).
id (int) – The ID of the current machine (1-based index).
num_machines (int) – The total number of machines (integer).
- Returns:
A subset of the data that corresponds to the given id.
- Return type:
list[Any] | np.ndarray | pd.Dataframe
- OCDocker.OCScore.Utils.Data.compute_zscore(df, columns)[source]¶
Computes the z-score for the specified columns in a DataFrame.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
columns (list) – List of columns to compute the z-score for.
- Returns:
DataFrame with z-score values for the specified columns.
- Return type:
pd.DataFrame
- OCDocker.OCScore.Utils.Data.detect_extreme_outliers_iqr_columns_positive(df, columns, extreme_factor=3.0)[source]¶
Detects extreme outliers in specified columns of a DataFrame using the IQR method.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
columns (list[str]) – List of columns to check for extreme outliers.
extreme_factor (float, optional) – The factor to determine extreme outliers. The default is 3.0.
- Returns:
Dictionary containing the extreme outliers for each specified column.
- Return type:
dict
- OCDocker.OCScore.Utils.Data.generate_mask(column_names, score_columns)[source]¶
Generates masks with combinations of 0s and 1s for columns that match a regex pattern. Columns that don’t match the regex are filled with 1s.
- Parameters:
column_names (list[str] | pd.Index) – A list of strings, pandas series or pandas index representing column names.
score_columns (list[str]) – Column names that should have combinations of 0s and 1s.
- Returns:
A list of numpy arrays, where columns matching the regex pattern have combinations of 0s and 1s, and columns that don’t match are filled with 1s.
- Return type:
list[np.ndarray]
- OCDocker.OCScore.Utils.Data.get_column_order(data=None)[source]¶
Get the column order from a data source (file path or DataFrame) or from config.
This function extracts the column order from either a file path, an existing DataFrame, or from the config file if no data source is provided. This ensures consistency with the order used during model training. This is critical for proper mask application and feature alignment.
- Parameters:
data (str | pd.DataFrame | None, optional) – Either: - A file path (CSV or gzipped CSV) to load column order from - A pandas DataFrame to extract column order from - None to use the column order from config (default: None)
- Returns:
List of column names in the exact order they appear in the data source or config.
- Return type:
list[str]
- Raises:
FileNotFoundError – If data is a string path and the file is not found.
TypeError – If data is neither a string, DataFrame, nor None.
ValueError – If data is None and config does not have reference_column_order set.
- OCDocker.OCScore.Utils.Data.invert_values_conditionally(df, regex_pattern='^(VINA|SMINA|PLANTS).*|^experimental$', inplace=False)[source]¶
Inverts the values of specific columns in a DataFrame. The inversion is applied to columns that start with ‘VINA’, ‘SMINA’, or ‘PLANTS’ as well as the column named ‘experimental’.
This function multiplies the values in these columns by -1, effectively inverting them. It’s particularly useful in scenarios where the sign of these values needs to be reversed for analysis or data processing.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
regex_pattern (str) – Regular expression pattern to match the columns to invert. The default pattern matches columns that start with ‘VINA’ or ‘SMINA’, as well as the column named ‘experimental’. (r”^(VINA|SMINA).*|^experimental$”)
inplace (bool) – If True, the original DataFrame is modified. If False, a new DataFrame is returned.
- Returns:
DataFrame with inverted values, ensuring not to modify the original DataFrame.
- Return type:
pd.DataFrame
- OCDocker.OCScore.Utils.Data.load_data(base_models_folder, storage_id, df_path, optimization_type, pca_model='', no_scores=False, only_scores=False, use_PCA=False, pca_type=95, use_pdb_train=True, random_seed=42, invert_conditionally=True, normalize=True, scaler='standard', enforce_reference_order=False)[source]¶
Process the data for training and testing the models.
- Parameters:
base_models_folder (str) – The base folder to store the models.
storage_id (int) – The storage ID for the models.
df_path (str) – The path to the DataFrame file.
optimization_type (str) – The optimization type.
pca_model (str | PCA, optional) – The PCA model or the path to the PCA model. The default is “”.
no_scores (bool, optional) – If True, no scores are used. The default is False. (Will override only_scores)
only_scores (bool, optional) – If True, only the score columns are used. The default is False.
use_PCA (bool, optional) – If True, PCA is applied to the data. The default is False.
pca_type (str | int, optional) – The PCA type. The default is “95”. Options are “95”, “90”, “85”, and “80”.
use_pdb_train (bool, optional) – If True, the PDBbind data is used for training. The default is True.
random_seed (int, optional) – The random seed for splitting the data. The default is 42.
invert_conditionally (bool, optional) – If True, invert score-like columns conditionally during preprocessing. The default is True.
normalize (bool, optional) – If True, normalize data during preprocessing. The default is True.
scaler (str, optional) – Scaler used when normalize=True. Options are “standard” and “minmax”. The default is “standard”.
enforce_reference_order (bool, optional) – If True, reorder columns using
reference_column_orderfrom config before split, ensuring stable feature/mask alignment. The default is False.
- Returns:
Dictionary containing the processed data. The keys are: - models_folder: The models folder. - study_name: The study name. - X_train: The training input features. - X_test: The testing input features. - y_train: The training target variable. - y_test: The testing target variable. - X_val: The validation input features. - y_val: The validation target variable.
- Return type:
dict
- OCDocker.OCScore.Utils.Data.norm_data(df, scaler='standard', inplace=False)[source]¶
Preprocesses the input DataFrame by scaling selected feature columns using a Scaler. The metadata columns (“receptor”, “ligand”, “name”, “type”, “db”) and target variable (“experimental”) are preserved and excluded from scaling.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
scaler (str | StandardScaler | MinMaxScaler) – Scaler to use. Options are: - “standard” or “minmax”: Creates and fits a new scaler - StandardScaler or MinMaxScaler object: Uses the provided pre-fitted scaler
inplace (bool) – If True, the original DataFrame is modified. If False, a new DataFrame is returned.
- Returns:
DataFrame with normalized features while preserving metadata and target variable. If scaler is a string (new scaler), returns tuple of (DataFrame, fitted_scaler) if inplace=False, or just DataFrame if inplace=True. If scaler is a pre-fitted object, returns only the DataFrame.
- Return type:
pd.DataFrame | tuple[pd.DataFrame, Union[StandardScaler, MinMaxScaler]]
- OCDocker.OCScore.Utils.Data.preprocess_df(file_name: str, score_columns_list: list[str] = ['SMINA', 'VINA', 'ODDT', 'PLANTS'], outliers_columns_list: list[str] | None = None, scaler: str = 'standard', invert_conditionally: bool = True, normalize: bool = True, return_scaler: Literal[False] = False) tuple[DataFrame, DataFrame, list[str]][source]¶
- OCDocker.OCScore.Utils.Data.preprocess_df(file_name: str, score_columns_list: list[str] = ['SMINA', 'VINA', 'ODDT', 'PLANTS'], outliers_columns_list: list[str] | None = None, scaler: str = 'standard', invert_conditionally: bool = True, normalize: bool = True, return_scaler: Literal[True] = True) tuple[DataFrame, DataFrame, list[str], StandardScaler | MinMaxScaler]
Load a DataFrame from a file and preprocess it.
- Parameters:
file_name (str) – The name of the file to load the DataFrame from.
score_columns_list (list[str], optional) – The list of columns to be considered as score columns. The default is [“SMINA”, “VINA”, “ODDT”, “PLANTS”].
outliers_columns_list (list[str], optional) – The list of columns to analyze for outliers. If None, defaults to ‘PLANTS’ columns. The default is None.
scaler (str, optional) – The scaler to use. The default is “standard”. Options are “standard” and “minmax”.
invert_conditionally (bool, optional) – If True, the values in the score columns are inverted conditionally. The default is True.
normalize (bool, optional) – If True, the data is normalized. The default is True.
return_scaler (bool, optional) – If True, returns the fitted scaler along with the data. The scaler is fitted on PDBbind data (training data) and used to transform both datasets. Default is False.
- Returns:
If return_scaler is False: (dudez_data, pdbbind_data, score_columns) If return_scaler is True: (dudez_data, pdbbind_data, score_columns, fitted_scaler)
- Return type:
tuple
- OCDocker.OCScore.Utils.Data.remove_extreme_outliers_iqr_columns_positive(df, columns, extreme_factor=3.0)[source]¶
Removes rows with extreme outliers in specified columns of a DataFrame using the IQR method.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
columns (list[str]) – List of columns to check for extreme outliers.
extreme_factor (float, optional) – The factor to determine extreme outliers. The default is 3.0.
- Returns:
DataFrame with rows containing extreme outliers removed.
- Return type:
pd.DataFrame
- OCDocker.OCScore.Utils.Data.remove_other_columns(df, columns_to_keep, inplace=False)[source]¶
Removes columns from a DataFrame that are not in the specified list.
- Parameters:
df (pd.DataFrame) – Input DataFrame.
columns_to_keep (list) – List of columns to keep.
inplace (bool) – If True, the original DataFrame is modified. If False, a new DataFrame is returned.
- Returns:
DataFrame with only the specified columns.
- Return type:
pd.DataFrame
- OCDocker.OCScore.Utils.Data.reorder_columns_to_match_data_order(df, data_source=None, keep_extra_columns=True, fill_missing_columns=False)[source]¶
Reorder DataFrame columns to match the column order from another data source.
!!! CRITICAL: This function ensures that all columns are in the exact same order as the data source, which is essential for proper mask application and model inference. The order of scoring functions (SFs) is particularly important for masks.
This is typically used to ensure prediction data has the same column order as the training data, ensuring masks and models work correctly.
- Parameters:
df (pd.DataFrame) – Input DataFrame to reorder.
data_source (str | pd.DataFrame | None, optional) – Data source to match column order from. Either: - A file path (CSV or gzipped CSV) to load column order from - A pandas DataFrame to extract column order from - None to use reference_column_order from config (default: None)
keep_extra_columns (bool, optional) – If True, columns not in data_source are kept at the end (default: True). If False, extra columns are dropped.
fill_missing_columns (bool, optional) – If True, missing columns from data_source are added as NaN (default: False). If False, missing columns are simply not included.
- Returns:
DataFrame with columns reordered to match data_source column order.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If data_source is a string path and the file is not found.
TypeError – If data_source is neither a string nor a DataFrame.
- OCDocker.OCScore.Utils.Data.split_dataset(X, y, test_size=0.2, random_state=42)[source]¶
Split the data into training and testing sets.
- Parameters:
X (pd.DataFrame) – The input features.
y (pd.Series) – The target variable.
test_size (float, optional) – The proportion of the dataset to include in the test split. The default is 0.2.
random_state (int, optional) – The seed used by the random number generator. The default is 42.
- Returns:
X_train (pd.DataFrame) – The training input features.
X_test (pd.DataFrame) – The testing input features.
y_train (pd.Series) – The training target variable.
y_test (pd.Series) – The testing target variable.
- Return type:
tuple[Any, Any, Any, Any]