pyccea.utils package

Submodules

pyccea.utils.config module

pyccea.utils.config.load_params(conf_path)[source]

Load parameters of a Cooperative Co-Evolutionary Algorithm (CCEA) from configuration file.

Parameters:
conf_path: str

Path to the configuration file.

Returns
——-
conf: dict

Configuration parameters of a CCEA.

pyccea.utils.datasets module

class pyccea.utils.datasets.DataLoader(dataset: str, conf: dict)[source]

Bases: object

Load dataset and preprocess it to train machine learning algorithms.

Attributes:
datapd.DataFrame

Raw dataset.

Xpd.DataFrame

Raw input data.

ypd.Series

Raw output data.

X_trainnp.ndarray

Train input data.

X_testnp.ndarray

Test input data.

y_trainnp.ndarray

Train output data.

y_testnp.ndarray

Test output data.

n_examplesint

Total number of examples.

n_featuresint

Number of features in the dataset.

n_classesint

Number of classes.

classesnp.ndarray

Class identifiers.

train_sizeint

Number of examples in the training set.

test_sizeint

Number of examples in the test set.

seedint, default None

It controls the randomness of the data split.

presetbool, default False

In some works, the training and testing sets have already been defined. To use them, just set this boolean variable to True.

test_ratiofloat

Proportion of the dataset to include in the test set. It should be between 0 and 1.

splitter_typestr

Model selection strategy. It can be “k_fold” or “leave_one_out”.

kfoldsint or None

Number of folds in the k-fold cross validation or in the leave-one-out cross validation.

stratifiedbool, default False

If True, the folds are made by preserving the percentage of examples for each class. It is only used in case ‘splitter_type’ parameter is set to ‘k_fold’.

normalizebool, default False

If True, normalizes training and test sets generated by the split method.

Methods

get_fold(k[, normalize])

Get k-th train and validation folds.

get_ready()

Prepare data for a machine learning algorithm to perform feature selection.

DATASETS = {'11_tumor': {'file': '11_tumor.parquet', 'task': 'classification'}, '9_tumor': {'file': '9_tumor.parquet', 'task': 'classification'}, 'brain_tumor_1': {'file': 'brain_tumor_1.parquet', 'task': 'classification'}, 'brain_tumor_2': {'file': 'brain_tumor_2.parquet', 'task': 'classification'}, 'cbd': {'file': 'cbd.parquet', 'task': 'classification'}, 'dermatology': {'file': 'dermatology.parquet', 'task': 'classification'}, 'divorce': {'file': 'divorce.parquet', 'task': 'classification'}, 'dlbcl': {'file': 'dlbcl.parquet', 'task': 'classification'}, 'dorothea': {'file': 'dorothea.parquet', 'task': 'classification'}, 'gfe': {'file': 'gfe.parquet', 'task': 'classification'}, 'hapt': {'file': 'hapt.parquet', 'task': 'classification'}, 'har': {'file': 'har.parquet', 'task': 'classification'}, 'isolet5': {'file': 'isolet5.parquet', 'task': 'classification'}, 'itt_f': {'file': 'itt_f.parquet', 'task': 'regression'}, 'itt_m': {'file': 'itt_m.parquet', 'task': 'regression'}, 'leukemia_1': {'file': 'leukemia_1.parquet', 'task': 'classification'}, 'leukemia_2': {'file': 'leukemia_2.parquet', 'task': 'classification'}, 'leukemia_3': {'file': 'leukemia_3.parquet', 'task': 'classification'}, 'libras': {'file': 'libras.parquet', 'task': 'classification'}, 'linear_synthetic': {'file': 'linear_synthetic_sample.parquet', 'task': 'classification'}, 'lsvt': {'file': 'lsvt.parquet', 'task': 'classification'}, 'lungc': {'file': 'lungc.parquet', 'task': 'classification'}, 'madelon_valid': {'file': 'madelon_valid.parquet', 'task': 'classification'}, 'mfd': {'file': 'mfd.parquet', 'task': 'classification'}, 'nonlinear_synthetic': {'file': 'nonlinear_synthetic_sample.parquet', 'task': 'classification'}, 'orh': {'file': 'orh.parquet', 'task': 'classification'}, 'prostate_tumor_1': {'file': 'prostate_tumor_1.parquet', 'task': 'classification'}, 'qsar_toxicity': {'file': 'qsar_oral_toxicity.parquet', 'task': 'classification'}, 'scadi': {'file': 'scadi.parquet', 'task': 'classification'}, 'shd': {'file': 'shd.parquet', 'task': 'classification'}, 'uji_indoor': {'file': 'uji_indoor_loc.parquet', 'task': 'classification'}, 'wdbc': {'file': 'wdbc.parquet', 'task': 'classification'}}
NORMALIZATION_METHODS = {'min_max': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'standard': <class 'sklearn.preprocessing._data.StandardScaler'>}
PRIMARY_CONF_KEYS = ['general', 'splitter', 'preprocessing', 'normalization']
SPLITTER_TYPES = ['k_fold', 'leave_one_out']
get_fold(k: int, normalize: bool | None = None) Tuple[ndarray, ndarray, ndarray, ndarray][source]

Get k-th train and validation folds.

Parameters:
kint

Fold index.

normalizebool, default None

If True, normalizes training and validation sets generated by the split method. If None, uses the value of the ‘normalize’ attribute.

Returns:
X_fold_trainnp.ndarray

k-th fold training input data.

y_fold_trainnp.ndarray

k-th fold training output data.

X_fold_valnp.ndarray

k-th fold validation input data.

y_fold_valnp.ndarray

k-th fold validation output data.

get_ready() None[source]

Prepare data for a machine learning algorithm to perform feature selection.

toml_file = <_io.TextIOWrapper name='/home/runner/work/PyCCEA/PyCCEA/pyccea/parameters/datasets.toml' encoding='utf-8'>

pyccea.utils.mapping module

pyccea.utils.mapping.angle_modulation_function(coeffs: ndarray, n_features: int) ndarray[source]

Homomorphous mapping between binary-valued and continuous-valued space used by the Angle Modulated Differential Evolution (AMDE) algorithm.

Parameters:
coeffs: np.ndarray (4,)

The AMDE evolves values for the four coefficients a, b, c, and d. The first coefficient represents the horizontal shift of the function, the second coefficient represents the maximum frequency of the sine function, the third coefficient represents the frequency of the cosine function, and the fourth coefficient represents the vertical shift of the function.

n_features: int

Number of variables in the original space (e.g., features in a subcomponent).

Returns:
binary_solution: np.ndarray (n_features,)

Binary solution in the original space.

pyccea.utils.mapping.shifted_heaviside_function(real_solution: ndarray, shift: float = 0.5) ndarray[source]

Convert continuous solutions into binary solutions through the heaviside step function.

Parameters:
real_solution: np.ndarray (n_features,)

Solution with continuous values.

shift: float, default 0.5

Threshold for coding the solution from continuous-valued space to binary-valued space. Each solution value will be mapped to 1 if it is greater than or equal to shift and 0 otherwise.

Returns:
binary_solution: np.ndarray (n_features,)

Real solution encoded in a binary representation.

pyccea.utils.memory module

pyccea.utils.memory.force_memory_release() None[source]

Force memory release back to the OS when supported.

pyccea.utils.metrics module

class pyccea.utils.metrics.ClassificationMetrics(n_classes: int)[source]

Bases: object

Evaluate machine learning model trained for a classification problem.

Attributes:
valuesdict()

Values of classification metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute classification metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute classification metrics.

Parameters:
estimatorBaseEstimator

Model that will be evaluated.

X_testnp.ndarray

Input test data.

y_testnp.ndarray

Output test data.

verbosebool, default False

If True, show evaluation metrics.

metrics = ['accuracy', 'balanced_accuracy', 'precision', 'recall', 'f1_score', 'specificity']
class pyccea.utils.metrics.RegressionMetrics[source]

Bases: object

Evaluate machine learning model trained for a regression problem.

Attributes:
valuesdict()

Values of regression metrics.

Methods

compute(estimator, X_test, y_test[, verbose])

Compute regression metrics.

compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]

Compute regression metrics.

Parameters:
estimatorBaseEstimator

Model that will be evaluated.

X_testnp.ndarray

Input test data.

y_testnp.ndarray

Output test data.

verbosebool, default False

If True, show evaluation metrics.

metrics = ['mae', 'mse', 'rmse', 'mape']

pyccea.utils.models module

class pyccea.utils.models.ClassificationModel(model_type: str)[source]

Bases: object

Load a classification model, adjust its hyperparameters and get the best model.

Attributes:
estimatorsklearn model object

Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.

hyperparamsdict

Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

clone()

Create a new unfitted estimator with the same parameters.

train(X_train, y_train[, seed, kfolds, ...])

Build and train a classification model.

clone() ClassificationModel[source]

Create a new unfitted estimator with the same parameters.

Clone does a deep copy of the model in an estimator without actually copying attached data. It returns a new estimator with the same parameters that has not been fitted on any data.

Returns:
cloned_modelClassificationModel

Fresh copy of the classification model wrapper with a cloned underlying sklearn estimator.

models = {'complement_naive_bayes': (<class 'sklearn.naive_bayes.ComplementNB'>, {}), 'gaussian_naive_bayes': (<class 'sklearn.naive_bayes.GaussianNB'>, {}), 'k_nearest_neighbors': (<class 'sklearn.neighbors._classification.KNeighborsClassifier'>, {'n_neighbors': 1}), 'logistic_regression': (<class 'sklearn.linear_model._logistic.LogisticRegression'>, {}), 'multinomial_naive_bayes': (<class 'sklearn.naive_bayes.MultinomialNB'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestClassifier'>, {'class_weight': 'balanced', 'random_state': 0}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVC'>, {'random_state': 0})}
train(X_train: ndarray, y_train: ndarray, seed: int | None = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a classification model.

Parameters:
X_train: np.ndarray

Train input data.

y_train: np.ndarray

Train output data.

seed: int, default None

Controls the shuffling applied for subsampling the data.

kfolds: int, default 10

Number of folds in the k-fold cross validation.

n_iter: int, default 100

Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.

optimize: bool, default False

If True, optimize the hyperparameters of the model and return the best model.

verbose: bool, default False

If True, show logs.

class pyccea.utils.models.RegressionModel(model_type: str)[source]

Bases: object

Load a regression model, adjust its hyperparameters and get the best model.

Attributes:
estimatorsklearn model object

Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.

hyperparamsdict

Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.

Methods

clone()

Create a new unfitted estimator with the same parameters.

train(X_train, y_train[, seed, kfolds, ...])

Build and train a regression model.

clone() RegressionModel[source]

Create a new unfitted estimator with the same parameters.

Clone does a deep copy of the model in an estimator without actually copying attached data. It returns a new estimator with the same parameters that has not been fitted on any data.

Returns:
cloned_modelRegressionModel

Fresh copy of the regression model wrapper with a cloned underlying sklearn estimator.

models = {'elastic_net': (<class 'sklearn.linear_model._coordinate_descent.ElasticNet'>, {'random_state': 0}), 'lasso': (<class 'sklearn.linear_model._coordinate_descent.Lasso'>, {'random_state': 0}), 'linear': (<class 'sklearn.linear_model._base.LinearRegression'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestRegressor'>, {'random_state': 0}), 'ridge': (<class 'sklearn.linear_model._ridge.Ridge'>, {'random_state': 0}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVR'>, {})}
train(X_train: ndarray, y_train: ndarray, seed: int | None = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]

Build and train a regression model.

Parameters:
X_train: np.ndarray

Train input data.

y_train: np.ndarray

Train output data.

seed: int, default None

Controls the shuffling applied for subsampling the data.

kfolds: int, default 10

Number of folds in the k-fold cross validation.

n_iter: int, default 100

Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.

optimize: bool, default False

If True, optimize the hyperparameters of the model and return the best model.

verbose: bool, default False

If True, show logs.

pyccea.utils.preprocessing module

class pyccea.utils.preprocessing.Winsoriser(lower: float = 0.01, upper: float = 0.99)[source]

Bases: BaseEstimator, TransformerMixin

Winsorization (quantile-based truncation) class.

The ‘fit’ method calculates the lower and upper quantiles on the training data. The ‘transform’ method applies the capping, replacing values below the lower bound with the bound itself, and values above the upper bound with the upper bound itself.

Attributes:
_lower_boundsndarray, shape (n_features,)

The lower quantile values calculated for each column of X during fit.

_upper_boundsndarray, shape (n_features,)

The upper quantile values calculated for each column of X during fit.

Methods

fit(X[, y])

Calculates the truncation limits (quantiles) per column (feature) in X.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Applies Winsorization (truncation) to X using the limits calculated in fit.

fit(X: ndarray, y: ndarray | None = None) None[source]

Calculates the truncation limits (quantiles) per column (feature) in X.

Parameters:
Xndarray, shape (n_samples, n_features)

The training dataset for which the quantiles will be calculated.

yndarray, default=None

Ignored. Present for scikit-learn API compatibility.

Returns:
selfobject

Returns the instance of the transformer.

transform(X: ndarray) ndarray[source]

Applies Winsorization (truncation) to X using the limits calculated in fit.

Parameters:
Xndarray, shape (n_samples, n_features)

The dataset to be transformed (can be training or test data).

Returns:
X_transformedndarray, shape (n_samples, n_features)

The transformed array with truncated outliers.

pyccea.utils.stats module

pyccea.utils.stats.statistical_comparison_between_independent_samples(x: ndarray, y: ndarray, alpha: float = 0.05, alternative: str = 'two-sided') Tuple[str, float, float, bool][source]

Perform a statistical comparison between two independent samples.

This function compares two independent samples x and y using a statistical test (e.g., t-test) to determine if there is a significant difference between them. The test is two-sided by default, but can be configured to perform one-sided tests. It returns the test statistic name, the computed p-value, the test statistic, and whether the null hypothesis is rejected at the given significance level alpha.

Parameters:
xnp.ndarray

The first independent sample.

ynp.ndarray

The second independent sample.

alphafloat, optional

The significance level for the hypothesis test (default is 0.05).

alternativestr, optional

Defines the alternative hypothesis. Options are ‘two-sided’, ‘greater’, or ‘less’ (default is ‘two-sided’).

Returns:
test_namestr

The name of the statistical test used (e.g., ‘t-test’, ‘mann-whitney-u’).

p_comparisonfloat

The computed p-value of the test.

statisticfloat

The test statistic from the hypothesis test.

reject_nullbool

True if the null hypothesis is rejected at the given significance level, False otherwise.

Raises:
ValueError

If the input arrays x and y have different lengths or are not one-dimensional.

ValueError

If the alternative argument is not one of ‘two-sided’, ‘greater’, or ‘less’.