pyccea.utils package

Submodules

pyccea.utils.config module

pyccea.utils.config.load_params(conf_path)[source]

Load parameters of a Cooperative Co-Evolutionary Algorithm (CCEA) from configuration file.

Parameters:
conf_path: str

Path to the configuration file.

Returns
——-
conf: dict

Configuration parameters of a CCEA.

pyccea.utils.datasets module

class pyccea.utils.datasets.DataLoader(dataset: str, conf: dict)[source]

Bases: object

Load dataset and preprocess it to train machine learning algorithms.

Attributes:
datapd.DataFrame

Raw dataset.

Xpd.DataFrame

Raw input data.

ypd.Series

Raw output data.

X_trainnp.ndarray

Train input data.

X_testnp.ndarray

Test input data.

y_trainnp.ndarray

Train output data.

y_testnp.ndarray

Test output data.

n_examplesint

Total number of examples.

n_featuresint

Number of features in the dataset.

n_classesint

Number of classes.

classesnp.ndarray

Class identifiers.

train_sizeint

Number of examples in the training set.

test_sizeint

Number of examples in the test set.

seedint, default None

It controls the randomness of the data split.

presetbool, default False

In some works, the training and testing sets have already been defined. To use them, just set this boolean variable to True.

test_ratiofloat

Proportion of the dataset to include in the test set. It should be between 0 and 1.

splitter_typestr

Model selection strategy. It can be “k_fold” or “leave_one_out”.

kfoldsint or None

Number of folds in the k-fold cross validation or in the leave-one-out cross validation.

stratifiedbool, default False

If True, the folds are made by preserving the percentage of examples for each class. It is only used in case ‘splitter_type’ parameter is set to ‘k_fold’.

normalizebool, default False

If True, normalizes training and test sets generated by the split method.

Methods

get_ready()

Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.

DATASETS = {'11_tumor': {'file': '11_tumor.parquet', 'task': 'classification'}, '9_tumor': {'file': '9_tumor.parquet', 'task': 'classification'}, 'brain_tumor_1': {'file': 'brain_tumor_1.parquet', 'task': 'classification'}, 'brain_tumor_2': {'file': 'brain_tumor_2.parquet', 'task': 'classification'}, 'cbd': {'file': 'cbd.parquet', 'task': 'classification'}, 'dermatology': {'file': 'dermatology.parquet', 'task': 'classification'}, 'divorce': {'file': 'divorce.parquet', 'task': 'classification'}, 'dlbcl': {'file': 'dlbcl.parquet', 'task': 'classification'}, 'dorothea': {'file': 'dorothea.parquet', 'task': 'classification'}, 'gfe': {'file': 'gfe.parquet', 'task': 'classification'}, 'hapt': {'file': 'hapt.parquet', 'task': 'classification'}, 'har': {'file': 'har.parquet', 'task': 'classification'}, 'isolet5': {'file': 'isolet5.parquet', 'task': 'classification'}, 'itt_f': {'file': 'itt_f.parquet', 'task': 'regression'}, 'itt_m': {'file': 'itt_m.parquet', 'task': 'regression'}, 'leukemia_1': {'file': 'leukemia_1.parquet', 'task': 'classification'}, 'leukemia_2': {'file': 'leukemia_2.parquet', 'task': 'classification'}, 'leukemia_3': {'file': 'leukemia_3.parquet', 'task': 'classification'}, 'libras': {'file': 'libras.parquet', 'task': 'classification'}, 'lsvt': {'file': 'lsvt.parquet', 'task': 'classification'}, 'lungc': {'file': 'lungc.parquet', 'task': 'classification'}, 'madelon_valid': {'file': 'madelon_valid.parquet', 'task': 'classification'}, 'mfd': {'file': 'mfd.parquet', 'task': 'classification'}, 'orh': {'file': 'orh.parquet', 'task': 'classification'}, 'prostate_tumor_1': {'file': 'prostate_tumor_1.parquet', 'task': 'classification'}, 'qsar_toxicity': {'file': 'qsar_oral_toxicity.parquet', 'task': 'classification'}, 'scadi': {'file': 'scadi.parquet', 'task': 'classification'}, 'shd': {'file': 'shd.parquet', 'task': 'classification'}, 'uji_indoor': {'file': 'uji_indoor_loc.parquet', 'task': 'classification'}, 'wdbc': {'file': 'wdbc.parquet', 'task': 'classification'}}
NORMALIZATION_METHODS = {'min_max': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'standard': <class 'sklearn.preprocessing._data.StandardScaler'>}
PRIMARY_CONF_KEYS = ['general', 'splitter', 'normalization']
SPLITTER_TYPES = ['k_fold', 'leave_one_out']
get_ready() None[source]

Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.

toml_file = <_io.TextIOWrapper name='/home/runner/work/PyCCEA/PyCCEA/pyccea/parameters/datasets.toml' mode='r' encoding='utf-8'>

pyccea.utils.mapping module

pyccea.utils.mapping.angle_modulation_function(coeffs: ndarray, n_features: int) ndarray[source]

Homomorphous mapping between binary-valued and continuous-valued space used by the Angle Modulated Differential Evolution (AMDE) algorithm.

Parameters:
coeffs: np.ndarray (4,)

The AMDE evolves values for the four coefficients a, b, c, and d. The first coefficient represents the horizontal shift of the function, the second coefficient represents the maximum frequency of the sine function, the third coefficient represents the frequency of the cosine function, and the fourth coefficient represents the vertical shift of the function.

n_features: int

Number of variables in the original space (e.g., features in a subcomponent).

Returns:
binary_solution: np.ndarray (n_features,)

Binary solution in the original space.

pyccea.utils.mapping.shifted_heaviside_function(real_solution: ndarray, shift: float = 0.5) ndarray[source]

Convert continuous solutions into binary solutions through the heaviside step function.

Parameters:
real_solution: np.ndarray (n_features,)

Solution with continuous values.

shift: float, default 0.5

Threshold for coding the solution from continuous-valued space to binary-valued space. Each solution value will be mapped to 1 if it is greater than or equal to shift and 0 otherwise.

Returns:
binary_solution: np.ndarray (n_features,)

Real solution encoded in a binary representation.

pyccea.utils.metrics module

class pyccea.utils.metrics.ClassificationMetrics(n_classes: int)[source]

Bases: object

Evaluate machine learning model trained for a classification problem.

Attributes:
valuesdict()

Values of classification metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute classification metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute classification metrics.

Parameters:
estimatorBaseEstimator

Model that will be evaluated.

X_testnp.ndarray

Input test data.

y_testnp.ndarray

Output test data.

verbosebool, default False

If True, show evaluation metrics.

metrics = ['accuracy', 'balanced_accuracy', 'precision', 'recall', 'f1_score', 'specificity']
class pyccea.utils.metrics.RegressionMetrics[source]

Bases: object

Evaluate machine learning model trained for a regression problem.

Attributes:
valuesdict()

Values of regression metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute regression metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute regression metrics.

Parameters:
estimatorBaseEstimator

Model that will be evaluated.

X_testnp.ndarray

Input test data.

y_testnp.ndarray

Output test data.

verbosebool, default False

If True, show evaluation metrics.

metrics = ['mae', 'mse', 'rmse', 'mape']

pyccea.utils.models module

class pyccea.utils.models.ClassificationModel(model_type: str)[source]

Bases: object

Load a classification model, adjust its hyperparameters and get the best model.

Attributes:
estimatorsklearn model object

Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.

hyperparamsdict

Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

train(X_train, y_train[, seed, kfolds, ...])

Build and train a classification model.

models = {'complement_naive_bayes': (<class 'sklearn.naive_bayes.ComplementNB'>, {}), 'gaussian_naive_bayes': (<class 'sklearn.naive_bayes.GaussianNB'>, {}), 'k_nearest_neighbors': (<class 'sklearn.neighbors._classification.KNeighborsClassifier'>, {'n_neighbors': 1}), 'logistic_regression': (<class 'sklearn.linear_model._logistic.LogisticRegression'>, {}), 'multinomial_naive_bayes': (<class 'sklearn.naive_bayes.MultinomialNB'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestClassifier'>, {'n_jobs': -1}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVC'>, {})}
train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a classification model.

Parameters:
X_train: np.ndarray

Train input data.

y_train: np.ndarray

Train output data.

seed: int, default None

Controls the shuffling applied for subsampling the data.

kfolds: int, default 10

Number of folds in the k-fold cross validation.

n_iter: int, default 100

Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.

optimize: bool, default False

If True, optimize the hyperparameters of the model and return the best model.

verbose: bool, default False

If True, show logs.

class pyccea.utils.models.RegressionModel(model_type: str)[source]

Bases: object

Load a regression model, adjust its hyperparameters and get the best model.

Attributes:
estimatorsklearn model object

Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.

hyperparamsdict

Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

train(X_train, y_train[, seed, kfolds, ...])

Build and train a regression model.

models = {'elastic_net': (<class 'sklearn.linear_model._coordinate_descent.ElasticNet'>, {}), 'lasso': (<class 'sklearn.linear_model._coordinate_descent.Lasso'>, {}), 'linear': (<class 'sklearn.linear_model._base.LinearRegression'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestRegressor'>, {}), 'ridge': (<class 'sklearn.linear_model._ridge.Ridge'>, {}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVR'>, {})}
train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a regression model.

Parameters:
X_train: np.ndarray

Train input data.

y_train: np.ndarray

Train output data.

seed: int, default None

Controls the shuffling applied for subsampling the data.

kfolds: int, default 10

Number of folds in the k-fold cross validation.

n_iter: int, default 100

Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.

optimize: bool, default False

If True, optimize the hyperparameters of the model and return the best model.

verbose: bool, default False

If True, show logs.

pyccea.utils.stats module

pyccea.utils.stats.statistical_comparison_between_independent_samples(x: ndarray, y: ndarray, alpha: float = 0.05, alternative: str = 'two-sided') Tuple[str, float, float, bool][source]

Perform a statistical comparison between two independent samples.

This function compares two independent samples x and y using a statistical test (e.g., t-test) to determine if there is a significant difference between them. The test is two-sided by default, but can be configured to perform one-sided tests. It returns the test statistic name, the computed p-value, the test statistic, and whether the null hypothesis is rejected at the given significance level alpha.

Parameters:
xnp.ndarray

The first independent sample.

ynp.ndarray

The second independent sample.

alphafloat, optional

The significance level for the hypothesis test (default is 0.05).

alternativestr, optional

Defines the alternative hypothesis. Options are ‘two-sided’, ‘greater’, or ‘less’ (default is ‘two-sided’).

Returns:
test_namestr

The name of the statistical test used (e.g., ‘t-test’, ‘mann-whitney-u’).

p_comparisonfloat

The computed p-value of the test.

statisticfloat

The test statistic from the hypothesis test.

reject_nullbool

True if the null hypothesis is rejected at the given significance level, False otherwise.

Raises:
ValueError

If the input arrays x and y have different lengths or are not one-dimensional.

ValueError

If the alternative argument is not one of ‘two-sided’, ‘greater’, or ‘less’.