pyccea.utils package

Submodules

pyccea.utils.config module

pyccea.utils.config.load_params(conf_path)[source]

Load parameters of a Cooperative Co-Evolutionary Algorithm (CCEA) from configuration file.

Parameters:

conf_path: str: Path to the configuration file.
Returns
——-
conf: dict: Configuration parameters of a CCEA.

pyccea.utils.datasets module

class pyccea.utils.datasets.DataLoader(dataset: str, conf: dict)[source]

Bases: object

Load dataset and preprocess it to train machine learning algorithms.

Attributes:

datapd.DataFrame: Raw dataset.
Xpd.DataFrame: Raw input data.
ypd.Series: Raw output data.
X_trainnp.ndarray: Train input data.
X_testnp.ndarray: Test input data.
y_trainnp.ndarray: Train output data.
y_testnp.ndarray: Test output data.
n_examplesint: Total number of examples.
n_featuresint: Number of features in the dataset.
n_classesint: Number of classes.
classesnp.ndarray: Class identifiers.
train_sizeint: Number of examples in the training set.
test_sizeint: Number of examples in the test set.
seedint, default None: It controls the randomness of the data split.
presetbool, default False: In some works, the training and testing sets have already been defined. To use them, just set this boolean variable to True.
test_ratiofloat: Proportion of the dataset to include in the test set. It should be between 0 and 1.
splitter_typestr: Model selection strategy. It can be “k_fold” or “leave_one_out”.
kfoldsint or None: Number of folds in the k-fold cross validation or in the leave-one-out cross validation.
stratifiedbool, default False: If True, the folds are made by preserving the percentage of examples for each class. It is only used in case ‘splitter_type’ parameter is set to ‘k_fold’.
normalizebool, default False: If True, normalizes training and test sets generated by the split method.

Methods

get_ready()

Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.

DATASETS = {'11_tumor': {'file': '11_tumor.parquet', 'task': 'classification'}, '9_tumor': {'file': '9_tumor.parquet', 'task': 'classification'}, 'brain_tumor_1': {'file': 'brain_tumor_1.parquet', 'task': 'classification'}, 'brain_tumor_2': {'file': 'brain_tumor_2.parquet', 'task': 'classification'}, 'cbd': {'file': 'cbd.parquet', 'task': 'classification'}, 'dermatology': {'file': 'dermatology.parquet', 'task': 'classification'}, 'divorce': {'file': 'divorce.parquet', 'task': 'classification'}, 'dlbcl': {'file': 'dlbcl.parquet', 'task': 'classification'}, 'dorothea': {'file': 'dorothea.parquet', 'task': 'classification'}, 'gfe': {'file': 'gfe.parquet', 'task': 'classification'}, 'hapt': {'file': 'hapt.parquet', 'task': 'classification'}, 'har': {'file': 'har.parquet', 'task': 'classification'}, 'isolet5': {'file': 'isolet5.parquet', 'task': 'classification'}, 'itt_f': {'file': 'itt_f.parquet', 'task': 'regression'}, 'itt_m': {'file': 'itt_m.parquet', 'task': 'regression'}, 'leukemia_1': {'file': 'leukemia_1.parquet', 'task': 'classification'}, 'leukemia_2': {'file': 'leukemia_2.parquet', 'task': 'classification'}, 'leukemia_3': {'file': 'leukemia_3.parquet', 'task': 'classification'}, 'libras': {'file': 'libras.parquet', 'task': 'classification'}, 'lsvt': {'file': 'lsvt.parquet', 'task': 'classification'}, 'lungc': {'file': 'lungc.parquet', 'task': 'classification'}, 'madelon_valid': {'file': 'madelon_valid.parquet', 'task': 'classification'}, 'mfd': {'file': 'mfd.parquet', 'task': 'classification'}, 'orh': {'file': 'orh.parquet', 'task': 'classification'}, 'prostate_tumor_1': {'file': 'prostate_tumor_1.parquet', 'task': 'classification'}, 'qsar_toxicity': {'file': 'qsar_oral_toxicity.parquet', 'task': 'classification'}, 'scadi': {'file': 'scadi.parquet', 'task': 'classification'}, 'shd': {'file': 'shd.parquet', 'task': 'classification'}, 'uji_indoor': {'file': 'uji_indoor_loc.parquet', 'task': 'classification'}, 'wdbc': {'file': 'wdbc.parquet', 'task': 'classification'}}

NORMALIZATION_METHODS = {'min_max': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'standard': <class 'sklearn.preprocessing._data.StandardScaler'>}

PRIMARY_CONF_KEYS = ['general', 'splitter', 'normalization']

SPLITTER_TYPES = ['k_fold', 'leave_one_out']

get_ready() → None[source]: Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.

toml_file = <_io.TextIOWrapper name='/home/runner/work/PyCCEA/PyCCEA/pyccea/parameters/datasets.toml' mode='r' encoding='utf-8'>

pyccea.utils.mapping module

pyccea.utils.mapping.angle_modulation_function(coeffs: ndarray, n_features: int) → ndarray[source]

Homomorphous mapping between binary-valued and continuous-valued space used by the Angle Modulated Differential Evolution (AMDE) algorithm.

Parameters:

coeffs: np.ndarray (4,): The AMDE evolves values for the four coefficients a, b, c, and d. The first coefficient represents the horizontal shift of the function, the second coefficient represents the maximum frequency of the sine function, the third coefficient represents the frequency of the cosine function, and the fourth coefficient represents the vertical shift of the function.
n_features: int: Number of variables in the original space (e.g., features in a subcomponent).

Returns:

binary_solution: np.ndarray (n_features,): Binary solution in the original space.

pyccea.utils.mapping.shifted_heaviside_function(real_solution: ndarray, shift: float = 0.5) → ndarray[source]

Convert continuous solutions into binary solutions through the heaviside step function.

Parameters:

real_solution: np.ndarray (n_features,): Solution with continuous values.
shift: float, default 0.5: Threshold for coding the solution from continuous-valued space to binary-valued space. Each solution value will be mapped to 1 if it is greater than or equal to shift and 0 otherwise.

Returns:

binary_solution: np.ndarray (n_features,): Real solution encoded in a binary representation.

pyccea.utils.metrics module

class pyccea.utils.metrics.ClassificationMetrics(n_classes: int)[source]

Bases: object

Evaluate machine learning model trained for a classification problem.

Attributes:

valuesdict(): Values of classification metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute classification metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute classification metrics.

Parameters:

estimatorBaseEstimator: Model that will be evaluated.
X_testnp.ndarray: Input test data.
y_testnp.ndarray: Output test data.
verbosebool, default False: If True, show evaluation metrics.

metrics = ['accuracy', 'balanced_accuracy', 'precision', 'recall', 'f1_score', 'specificity']

class pyccea.utils.metrics.RegressionMetrics[source]

Bases: object

Evaluate machine learning model trained for a regression problem.

Attributes:

valuesdict(): Values of regression metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute regression metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute regression metrics.

Parameters:

estimatorBaseEstimator: Model that will be evaluated.
X_testnp.ndarray: Input test data.
y_testnp.ndarray: Output test data.
verbosebool, default False: If True, show evaluation metrics.

metrics = ['mae', 'mse', 'rmse', 'mape']

pyccea.utils.models module

class pyccea.utils.models.ClassificationModel(model_type: str)[source]

Bases: object

Load a classification model, adjust its hyperparameters and get the best model.

Attributes:

estimatorsklearn model object: Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.
hyperparamsdict: Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

train(X_train, y_train[, seed, kfolds, ...])

Build and train a classification model.

models = {'complement_naive_bayes': (<class 'sklearn.naive_bayes.ComplementNB'>, {}), 'gaussian_naive_bayes': (<class 'sklearn.naive_bayes.GaussianNB'>, {}), 'k_nearest_neighbors': (<class 'sklearn.neighbors._classification.KNeighborsClassifier'>, {'n_neighbors': 1}), 'logistic_regression': (<class 'sklearn.linear_model._logistic.LogisticRegression'>, {}), 'multinomial_naive_bayes': (<class 'sklearn.naive_bayes.MultinomialNB'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestClassifier'>, {'n_jobs': -1}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVC'>, {})}

train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a classification model.

Parameters:

X_train: np.ndarray: Train input data.
y_train: np.ndarray: Train output data.
seed: int, default None: Controls the shuffling applied for subsampling the data.
kfolds: int, default 10: Number of folds in the k-fold cross validation.
n_iter: int, default 100: Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.
optimize: bool, default False: If True, optimize the hyperparameters of the model and return the best model.
verbose: bool, default False: If True, show logs.

class pyccea.utils.models.RegressionModel(model_type: str)[source]

Bases: object

Load a regression model, adjust its hyperparameters and get the best model.

Attributes:

estimatorsklearn model object: Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.
hyperparamsdict: Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

train(X_train, y_train[, seed, kfolds, ...])

Build and train a regression model.

models = {'elastic_net': (<class 'sklearn.linear_model._coordinate_descent.ElasticNet'>, {}), 'lasso': (<class 'sklearn.linear_model._coordinate_descent.Lasso'>, {}), 'linear': (<class 'sklearn.linear_model._base.LinearRegression'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestRegressor'>, {}), 'ridge': (<class 'sklearn.linear_model._ridge.Ridge'>, {}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVR'>, {})}

train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a regression model.

Parameters:

X_train: np.ndarray: Train input data.
y_train: np.ndarray: Train output data.
seed: int, default None: Controls the shuffling applied for subsampling the data.
kfolds: int, default 10: Number of folds in the k-fold cross validation.
n_iter: int, default 100: Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.
optimize: bool, default False: If True, optimize the hyperparameters of the model and return the best model.
verbose: bool, default False: If True, show logs.

pyccea.utils.stats module

pyccea.utils.stats.statistical_comparison_between_independent_samples(x: ndarray, y: ndarray, alpha: float = 0.05, alternative: str = 'two-sided') → Tuple[str, float, float, bool][source]

Perform a statistical comparison between two independent samples.

This function compares two independent samples x and y using a statistical test (e.g., t-test) to determine if there is a significant difference between them. The test is two-sided by default, but can be configured to perform one-sided tests. It returns the test statistic name, the computed p-value, the test statistic, and whether the null hypothesis is rejected at the given significance level alpha.

Parameters:

xnp.ndarray: The first independent sample.
ynp.ndarray: The second independent sample.
alphafloat, optional: The significance level for the hypothesis test (default is 0.05).
alternativestr, optional: Defines the alternative hypothesis. Options are ‘two-sided’, ‘greater’, or ‘less’ (default is ‘two-sided’).

Returns:

test_namestr: The name of the statistical test used (e.g., ‘t-test’, ‘mann-whitney-u’).
p_comparisonfloat: The computed p-value of the test.
statisticfloat: The test statistic from the hypothesis test.
reject_nullbool: True if the null hypothesis is rejected at the given significance level, False otherwise.

Raises:

ValueError: If the input arrays x and y have different lengths or are not one-dimensional.
ValueError: If the alternative argument is not one of ‘two-sided’, ‘greater’, or ‘less’.