OCDocker.OCScore.Dimensionality.GeneticAlgorithm module¶

Module to perform the feature selection using the Genetic Algorithm.

It is imported as:

from OCDocker.OCScore.Dimensionality.GeneticAlgorithm import GeneticAlgorithm

class OCDocker.OCScore.Dimensionality.GeneticAlgorithm.GeneticAlgorithm(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]¶

Bases: object

A class to optimize the feature selection for XGBoost using a genetic algorithm.

Parameters:

X_train (ndarray | DataFrame | Series) –
y_train (ndarray | DataFrame | Series) –
X_test (ndarray | DataFrame | Series) –
y_test (ndarray | DataFrame | Series) –
xgboost_params (dict) –
X_validation (None | ndarray | DataFrame | Series) –
y_validation (None | ndarray | DataFrame | Series) –
storage (str) –
evolution_params (dict | None) –
use_gpu (bool) –
early_stopping_rounds (int) –
random_state (int) –
fixed_features_index (list | None) –
verbose (bool) –

__init__(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]¶

Constructor for the GeneticAlgorithm class.

Parameters:

X_train (np.ndarray | pd.DataFrame | pd.Series) – The full training dataset.
y_train (np.ndarray | pd.DataFrame | pd.Series) – The training labels.
X_test (np.ndarray | pd.DataFrame | pd.Series) – The full test dataset.
y_test (np.ndarray | pd.DataFrame | pd.Series) – The test labels.
xgboost_params (dict) – The hyperparameters for the XGBoost model.
X_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation dataset and labels. Default is None.
y_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation labels. Default is None.
evolution_params (dict, None, optional) – The hyperparameters for the genetic algorithm. Default is None.
use_gpu (bool, optional) – Whether to use the GPU for training the XGBoost model.
random_state (int, optional) – The random state for the XGBoost model. Default is 42.
fixed_features_index (list, None, optional) – The indexes of the scores to be used for the evaluation. Default is None.
storage (str) –
early_stopping_rounds (int) –
verbose (bool) –

Return type:

None

crossover(parent1, parent2)[source]¶

A function to perform crossover for the genetic algorithm.

Parameters:

parent1 (np.ndarray) – The first parent.
parent2 (np.ndarray) – The second parent.

Returns:

The child individual.

Return type:

np.ndarray

fitness(individual)[source]¶

A function to calculate the fitness of a set of features represented by an individual.

Parameters:: individual (list) – A binary list representing the inclusion (1) or exclusion (0) of each feature.
Returns:: The metric score of the selected features and the model.
Return type:: tuple

genetic_algorithm(trial_params, trial)[source]¶

A function to perform the genetic algorithm for feature selection.

Parameters:

number_of_generations (int) – The number of generations.
population_size (int) – The size of the population.
mutation_rate (float) – The mutation rate.
trial_params (dict) –
trial (Any) –

Returns:

np.ndarray – The selected features.
XGBRegressor – The model.
float – The score of the selected features.
Union[None, float] – The AUC score of the selected features. If the validation dataset is not provided, None is returned.

Return type:

tuple[ndarray, XGBRegressor, float, None | float]

initialize_population(number_of_features, population_size)[source]¶

A function to initialize the population for the genetic algorithm.

Parameters:

number_of_features (int) – The number of features in the dataset.
population_size (int) – The size of the population.

Returns:

The initialized population.

Return type:

np.ndarray

mutation(individual, mutation_rate=0.05)[source]¶

A function to perform mutation for the genetic algorithm.

Parameters:

individual (np.ndarray) – The individual to be mutated.
mutation_rate (float, optional) – The mutation rate. Default is 0.05.

Returns:

The mutated individual.

Return type:

np.ndarray

objective(trial)[source]¶

The objective function for the Optuna optimization.

Parameters:: trial (optuna.Trial) – The trial object.
Returns:: The AUC score of the selected features.
Return type:: float

optimize(direction='maximize', n_trials=100, n_jobs=1, study_name='Genetic Algorithm for descriptor optimization', load_if_exists=True, verbose=False)[source]¶

A function to optimize the feature selection using the genetic algorithm using Optuna.

Parameters:

direction (str, optional) – The direction of the optimization. Default is “maximize”.
n_trials (int, optional) – The number of trials. Default is 100.
n_jobs (int, optional) – The number of jobs to run in parallel. Default is 1.
study_name (str, optional) – The name of the study. Default is “Genetic Algorithm for descriptor optimization”.
load_if_exists (bool, optional) – Whether to load the study if it exists. Default is True.
verbose (bool, optional) – Whether to print verbose output. Default is False.

Returns:

optuna.study.Study – The Optuna study object.
dict – The best hyperparameters.
float – The best score.

Return type:

tuple[optuna.study.Study, dict, float]

tournament_selection(population, fitnesses, tournament_size=3)[source]¶

A function to perform tournament selection for the genetic algorithm.

Parameters:

population (np.ndarray) – The current population.
fitnesses (np.ndarray) – The fitness scores of the population.
tournament_size (int, optional) – The size of the tournament. Default is 3.

Returns:

The selected individual.

Return type:

np.ndarray