OCDocker.OCScore.Dimensionality.GeneticAlgorithm module¶
Module to perform the feature selection using the Genetic Algorithm.
It is imported as:
from OCDocker.OCScore.Dimensionality.GeneticAlgorithm import GeneticAlgorithm
- class OCDocker.OCScore.Dimensionality.GeneticAlgorithm.GeneticAlgorithm(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]¶
Bases:
objectA class to optimize the feature selection for XGBoost using a genetic algorithm.
- Parameters:
X_train (ndarray | DataFrame | Series) –
y_train (ndarray | DataFrame | Series) –
X_test (ndarray | DataFrame | Series) –
y_test (ndarray | DataFrame | Series) –
xgboost_params (dict) –
X_validation (None | ndarray | DataFrame | Series) –
y_validation (None | ndarray | DataFrame | Series) –
storage (str) –
evolution_params (dict | None) –
use_gpu (bool) –
early_stopping_rounds (int) –
random_state (int) –
fixed_features_index (list | None) –
verbose (bool) –
- __init__(X_train, y_train, X_test, y_test, xgboost_params, X_validation=None, y_validation=None, storage='sqlite:///GA.db', evolution_params=None, use_gpu=False, early_stopping_rounds=20, random_state=42, fixed_features_index=None, verbose=False)[source]¶
Constructor for the GeneticAlgorithm class.
- Parameters:
X_train (np.ndarray | pd.DataFrame | pd.Series) – The full training dataset.
y_train (np.ndarray | pd.DataFrame | pd.Series) – The training labels.
X_test (np.ndarray | pd.DataFrame | pd.Series) – The full test dataset.
y_test (np.ndarray | pd.DataFrame | pd.Series) – The test labels.
xgboost_params (dict) – The hyperparameters for the XGBoost model.
X_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation dataset and labels. Default is None.
y_validation (np.ndarray | pd.DataFrame | pd.Series, optional) – The validation labels. Default is None.
evolution_params (dict, None, optional) – The hyperparameters for the genetic algorithm. Default is None.
use_gpu (bool, optional) – Whether to use the GPU for training the XGBoost model.
random_state (int, optional) – The random state for the XGBoost model. Default is 42.
fixed_features_index (list, None, optional) – The indexes of the scores to be used for the evaluation. Default is None.
storage (str) –
early_stopping_rounds (int) –
verbose (bool) –
- Return type:
None
- crossover(parent1, parent2)[source]¶
A function to perform crossover for the genetic algorithm.
- Parameters:
parent1 (np.ndarray) – The first parent.
parent2 (np.ndarray) – The second parent.
- Returns:
The child individual.
- Return type:
np.ndarray
- fitness(individual)[source]¶
A function to calculate the fitness of a set of features represented by an individual.
- Parameters:
individual (list) – A binary list representing the inclusion (1) or exclusion (0) of each feature.
- Returns:
The metric score of the selected features and the model.
- Return type:
tuple
- genetic_algorithm(trial_params, trial)[source]¶
A function to perform the genetic algorithm for feature selection.
- Parameters:
number_of_generations (int) – The number of generations.
population_size (int) – The size of the population.
mutation_rate (float) – The mutation rate.
trial_params (dict) –
trial (Any) –
- Returns:
np.ndarray – The selected features.
XGBRegressor – The model.
float – The score of the selected features.
Union[None, float] – The AUC score of the selected features. If the validation dataset is not provided, None is returned.
- Return type:
tuple[ndarray, XGBRegressor, float, None | float]
- initialize_population(number_of_features, population_size)[source]¶
A function to initialize the population for the genetic algorithm.
- Parameters:
number_of_features (int) – The number of features in the dataset.
population_size (int) – The size of the population.
- Returns:
The initialized population.
- Return type:
np.ndarray
- mutation(individual, mutation_rate=0.05)[source]¶
A function to perform mutation for the genetic algorithm.
- Parameters:
individual (np.ndarray) – The individual to be mutated.
mutation_rate (float, optional) – The mutation rate. Default is 0.05.
- Returns:
The mutated individual.
- Return type:
np.ndarray
- objective(trial)[source]¶
The objective function for the Optuna optimization.
- Parameters:
trial (optuna.Trial) – The trial object.
- Returns:
The AUC score of the selected features.
- Return type:
float
- optimize(direction='maximize', n_trials=100, n_jobs=1, study_name='Genetic Algorithm for descriptor optimization', load_if_exists=True, verbose=False)[source]¶
A function to optimize the feature selection using the genetic algorithm using Optuna.
- Parameters:
direction (str, optional) – The direction of the optimization. Default is “maximize”.
n_trials (int, optional) – The number of trials. Default is 100.
n_jobs (int, optional) – The number of jobs to run in parallel. Default is 1.
study_name (str, optional) – The name of the study. Default is “Genetic Algorithm for descriptor optimization”.
load_if_exists (bool, optional) – Whether to load the study if it exists. Default is True.
verbose (bool, optional) – Whether to print verbose output. Default is False.
- Returns:
optuna.study.Study – The Optuna study object.
dict – The best hyperparameters.
float – The best score.
- Return type:
tuple[optuna.study.Study, dict, float]
- tournament_selection(population, fitnesses, tournament_size=3)[source]¶
A function to perform tournament selection for the genetic algorithm.
- Parameters:
population (np.ndarray) – The current population.
fitnesses (np.ndarray) – The fitness scores of the population.
tournament_size (int, optional) – The size of the tournament. Default is 3.
- Returns:
The selected individual.
- Return type:
np.ndarray