OCDocker.OCScore.Dimensionality.GeneticAlgorithm module

Module to perform the feature selection using the Genetic Algorithm.

It is imported as:

from OCDocker.OCScore.Dimensionality.GeneticAlgorithm import GeneticAlgorithm

class OCDocker.OCScore.Dimensionality.GeneticAlgorithm.GeneticAlgorithm(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]

Bases: object

A class to optimize the feature selection for XGBoost using a genetic algorithm.

Parameters:
  • X_train (ndarray | DataFrame | Series) –

  • y_train (ndarray | DataFrame | Series) –

  • X_test (ndarray | DataFrame | Series) –

  • y_test (ndarray | DataFrame | Series) –

  • xgboost_params (dict) –

  • X_validation (None | ndarray | DataFrame | Series) –

  • y_validation (None | ndarray | DataFrame | Series) –

  • storage (str) –

  • evolution_params (dict | None) –

  • use_gpu (bool) –

  • early_stopping_rounds (int) –

  • random_state (int) –

  • fixed_features_index (list | None) –

  • verbose (bool) –

__init__(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]

Constructor for the GeneticAlgorithm class.

Parameters:
  • X_train (np.ndarray | pd.DataFrame | pd.Series) – The full training dataset.

  • y_train (np.ndarray | pd.DataFrame | pd.Series) – The training labels.

  • X_test (np.ndarray | pd.DataFrame | pd.Series) – The full test dataset.

  • y_test (np.ndarray | pd.DataFrame | pd.Series) – The test labels.

  • xgboost_params (dict) – The hyperparameters for the XGBoost model.

  • X_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation dataset and labels. Default is None.

  • y_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation labels. Default is None.

  • evolution_params (dict, None, optional) – The hyperparameters for the genetic algorithm. Default is None.

  • use_gpu (bool, optional) – Whether to use the GPU for training the XGBoost model.

  • random_state (int, optional) – The random state for the XGBoost model. Default is 42.

  • fixed_features_index (list, None, optional) – The indexes of the scores to be used for the evaluation. Default is None.

  • storage (str) –

  • early_stopping_rounds (int) –

  • verbose (bool) –

Return type:

None

crossover(parent1, parent2)[source]

A function to perform crossover for the genetic algorithm.

Parameters:
  • parent1 (np.ndarray) – The first parent.

  • parent2 (np.ndarray) – The second parent.

Returns:

The child individual.

Return type:

np.ndarray

fitness(individual)[source]

A function to calculate the fitness of a set of features represented by an individual.

Parameters:

individual (list) – A binary list representing the inclusion (1) or exclusion (0) of each feature.

Returns:

The metric score of the selected features and the model.

Return type:

tuple

genetic_algorithm(trial_params, trial)[source]

A function to perform the genetic algorithm for feature selection.

Parameters:
  • number_of_generations (int) – The number of generations.

  • population_size (int) – The size of the population.

  • mutation_rate (float) – The mutation rate.

  • trial_params (dict) –

  • trial (Any) –

Returns:

  • np.ndarray – The selected features.

  • XGBRegressor – The model.

  • float – The score of the selected features.

  • Union[None, float] – The AUC score of the selected features. If the validation dataset is not provided, None is returned.

Return type:

tuple[ndarray, XGBRegressor, float, None | float]

initialize_population(number_of_features, population_size)[source]

A function to initialize the population for the genetic algorithm.

Parameters:
  • number_of_features (int) – The number of features in the dataset.

  • population_size (int) – The size of the population.

Returns:

The initialized population.

Return type:

np.ndarray

mutation(individual, mutation_rate=0.05)[source]

A function to perform mutation for the genetic algorithm.

Parameters:
  • individual (np.ndarray) – The individual to be mutated.

  • mutation_rate (float, optional) – The mutation rate. Default is 0.05.

Returns:

The mutated individual.

Return type:

np.ndarray

objective(trial)[source]

The objective function for the Optuna optimization.

Parameters:

trial (optuna.Trial) – The trial object.

Returns:

The AUC score of the selected features.

Return type:

float

optimize(direction='maximize', n_trials=100, n_jobs=1, study_name='Genetic Algorithm for descriptor optimization', load_if_exists=True, verbose=False)[source]

A function to optimize the feature selection using the genetic algorithm using Optuna.

Parameters:
  • direction (str, optional) – The direction of the optimization. Default is “maximize”.

  • n_trials (int, optional) – The number of trials. Default is 100.

  • n_jobs (int, optional) – The number of jobs to run in parallel. Default is 1.

  • study_name (str, optional) – The name of the study. Default is “Genetic Algorithm for descriptor optimization”.

  • load_if_exists (bool, optional) – Whether to load the study if it exists. Default is True.

  • verbose (bool, optional) – Whether to print verbose output. Default is False.

Returns:

  • optuna.study.Study – The Optuna study object.

  • dict – The best hyperparameters.

  • float – The best score.

Return type:

tuple[optuna.study.Study, dict, float]

tournament_selection(population, fitnesses, tournament_size=3)[source]

A function to perform tournament selection for the genetic algorithm.

Parameters:
  • population (np.ndarray) – The current population.

  • fitnesses (np.ndarray) – The fitness scores of the population.

  • tournament_size (int, optional) – The size of the tournament. Default is 3.

Returns:

The selected individual.

Return type:

np.ndarray