pyccea.utils package
Submodules
pyccea.utils.config module
pyccea.utils.datasets module
- class pyccea.utils.datasets.DataLoader(dataset: str, conf: dict)[source]
Bases:
object
Load dataset and preprocess it to train machine learning algorithms.
- Attributes:
- datapd.DataFrame
Raw dataset.
- Xpd.DataFrame
Raw input data.
- ypd.Series
Raw output data.
- X_trainnp.ndarray
Train input data.
- X_testnp.ndarray
Test input data.
- y_trainnp.ndarray
Train output data.
- y_testnp.ndarray
Test output data.
- n_examplesint
Total number of examples.
- n_featuresint
Number of features in the dataset.
- n_classesint
Number of classes.
- classesnp.ndarray
Class identifiers.
- train_sizeint
Number of examples in the training set.
- test_sizeint
Number of examples in the test set.
- seedint, default None
It controls the randomness of the data split.
- presetbool, default False
In some works, the training and testing sets have already been defined. To use them, just set this boolean variable to True.
- test_ratiofloat
Proportion of the dataset to include in the test set. It should be between 0 and 1.
- splitter_typestr
Model selection strategy. It can be “k_fold” or “leave_one_out”.
- kfoldsint or None
Number of folds in the k-fold cross validation or in the leave-one-out cross validation.
- stratifiedbool, default False
If True, the folds are made by preserving the percentage of examples for each class. It is only used in case ‘splitter_type’ parameter is set to ‘k_fold’.
- normalizebool, default False
If True, normalizes training and test sets generated by the split method.
Methods
Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.
- DATASETS = {'11_tumor': {'file': '11_tumor.parquet', 'task': 'classification'}, '9_tumor': {'file': '9_tumor.parquet', 'task': 'classification'}, 'brain_tumor_1': {'file': 'brain_tumor_1.parquet', 'task': 'classification'}, 'brain_tumor_2': {'file': 'brain_tumor_2.parquet', 'task': 'classification'}, 'cbd': {'file': 'cbd.parquet', 'task': 'classification'}, 'dermatology': {'file': 'dermatology.parquet', 'task': 'classification'}, 'divorce': {'file': 'divorce.parquet', 'task': 'classification'}, 'dlbcl': {'file': 'dlbcl.parquet', 'task': 'classification'}, 'dorothea': {'file': 'dorothea.parquet', 'task': 'classification'}, 'gfe': {'file': 'gfe.parquet', 'task': 'classification'}, 'hapt': {'file': 'hapt.parquet', 'task': 'classification'}, 'har': {'file': 'har.parquet', 'task': 'classification'}, 'isolet5': {'file': 'isolet5.parquet', 'task': 'classification'}, 'itt_f': {'file': 'itt_f.parquet', 'task': 'regression'}, 'itt_m': {'file': 'itt_m.parquet', 'task': 'regression'}, 'leukemia_1': {'file': 'leukemia_1.parquet', 'task': 'classification'}, 'leukemia_2': {'file': 'leukemia_2.parquet', 'task': 'classification'}, 'leukemia_3': {'file': 'leukemia_3.parquet', 'task': 'classification'}, 'libras': {'file': 'libras.parquet', 'task': 'classification'}, 'lsvt': {'file': 'lsvt.parquet', 'task': 'classification'}, 'lungc': {'file': 'lungc.parquet', 'task': 'classification'}, 'madelon_valid': {'file': 'madelon_valid.parquet', 'task': 'classification'}, 'mfd': {'file': 'mfd.parquet', 'task': 'classification'}, 'orh': {'file': 'orh.parquet', 'task': 'classification'}, 'prostate_tumor_1': {'file': 'prostate_tumor_1.parquet', 'task': 'classification'}, 'qsar_toxicity': {'file': 'qsar_oral_toxicity.parquet', 'task': 'classification'}, 'scadi': {'file': 'scadi.parquet', 'task': 'classification'}, 'shd': {'file': 'shd.parquet', 'task': 'classification'}, 'uji_indoor': {'file': 'uji_indoor_loc.parquet', 'task': 'classification'}, 'wdbc': {'file': 'wdbc.parquet', 'task': 'classification'}}
- NORMALIZATION_METHODS = {'min_max': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'standard': <class 'sklearn.preprocessing._data.StandardScaler'>}
- PRIMARY_CONF_KEYS = ['general', 'splitter', 'normalization']
- SPLITTER_TYPES = ['k_fold', 'leave_one_out']
- get_ready() None [source]
Prepare the data for a Cooperative Co-Evolutionary Algorithm to perform feature selection.
- toml_file = <_io.TextIOWrapper name='/home/runner/work/PyCCEA/PyCCEA/pyccea/parameters/datasets.toml' mode='r' encoding='utf-8'>
pyccea.utils.mapping module
- pyccea.utils.mapping.angle_modulation_function(coeffs: ndarray, n_features: int) ndarray [source]
Homomorphous mapping between binary-valued and continuous-valued space used by the Angle Modulated Differential Evolution (AMDE) algorithm.
- Parameters:
- coeffs: np.ndarray (4,)
The AMDE evolves values for the four coefficients a, b, c, and d. The first coefficient represents the horizontal shift of the function, the second coefficient represents the maximum frequency of the sine function, the third coefficient represents the frequency of the cosine function, and the fourth coefficient represents the vertical shift of the function.
- n_features: int
Number of variables in the original space (e.g., features in a subcomponent).
- Returns:
- binary_solution: np.ndarray (n_features,)
Binary solution in the original space.
- pyccea.utils.mapping.shifted_heaviside_function(real_solution: ndarray, shift: float = 0.5) ndarray [source]
Convert continuous solutions into binary solutions through the heaviside step function.
- Parameters:
- real_solution: np.ndarray (n_features,)
Solution with continuous values.
- shift: float, default 0.5
Threshold for coding the solution from continuous-valued space to binary-valued space. Each solution value will be mapped to 1 if it is greater than or equal to shift and 0 otherwise.
- Returns:
- binary_solution: np.ndarray (n_features,)
Real solution encoded in a binary representation.
pyccea.utils.metrics module
- class pyccea.utils.metrics.ClassificationMetrics(n_classes: int)[source]
Bases:
object
Evaluate machine learning model trained for a classification problem.
- Attributes:
- valuesdict()
Values of classification metrics.
Methods
compute
(estimator, X_test, y_test[, verbose])Compute classification metrics.
- compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]
Compute classification metrics.
- Parameters:
- estimatorBaseEstimator
Model that will be evaluated.
- X_testnp.ndarray
Input test data.
- y_testnp.ndarray
Output test data.
- verbosebool, default False
If True, show evaluation metrics.
- metrics = ['accuracy', 'balanced_accuracy', 'precision', 'recall', 'f1_score', 'specificity']
- class pyccea.utils.metrics.RegressionMetrics[source]
Bases:
object
Evaluate machine learning model trained for a regression problem.
- Attributes:
- valuesdict()
Values of regression metrics.
Methods
compute
(estimator, X_test, y_test[, verbose])Compute regression metrics.
- compute(estimator: BaseEstimator, X_test: ndarray, y_test: ndarray, verbose: bool = False)[source]
Compute regression metrics.
- Parameters:
- estimatorBaseEstimator
Model that will be evaluated.
- X_testnp.ndarray
Input test data.
- y_testnp.ndarray
Output test data.
- verbosebool, default False
If True, show evaluation metrics.
- metrics = ['mae', 'mse', 'rmse', 'mape']
pyccea.utils.models module
- class pyccea.utils.models.ClassificationModel(model_type: str)[source]
Bases:
object
Load a classification model, adjust its hyperparameters and get the best model.
- Attributes:
- estimatorsklearn model object
Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.
- hyperparamsdict
Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.
Methods
train
(X_train, y_train[, seed, kfolds, ...])Build and train a classification model.
- models = {'complement_naive_bayes': (<class 'sklearn.naive_bayes.ComplementNB'>, {}), 'gaussian_naive_bayes': (<class 'sklearn.naive_bayes.GaussianNB'>, {}), 'k_nearest_neighbors': (<class 'sklearn.neighbors._classification.KNeighborsClassifier'>, {'n_neighbors': 1}), 'logistic_regression': (<class 'sklearn.linear_model._logistic.LogisticRegression'>, {}), 'multinomial_naive_bayes': (<class 'sklearn.naive_bayes.MultinomialNB'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestClassifier'>, {'n_jobs': -1}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVC'>, {})}
- train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]
Build and train a classification model.
- Parameters:
- X_train: np.ndarray
Train input data.
- y_train: np.ndarray
Train output data.
- seed: int, default None
Controls the shuffling applied for subsampling the data.
- kfolds: int, default 10
Number of folds in the k-fold cross validation.
- n_iter: int, default 100
Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.
- optimize: bool, default False
If True, optimize the hyperparameters of the model and return the best model.
- verbose: bool, default False
If True, show logs.
- class pyccea.utils.models.RegressionModel(model_type: str)[source]
Bases:
object
Load a regression model, adjust its hyperparameters and get the best model.
- Attributes:
- estimatorsklearn model object
Trained model. In case optimize is True, it is the model that was chosen by the Randomized Search, i.e., estimator which gave the best result on the validation data.
- hyperparamsdict
Hyperparameters of the model. In case optimize is True, it is the best hyperparameters used to fit the machine learning model.
Methods
train
(X_train, y_train[, seed, kfolds, ...])Build and train a regression model.
- models = {'elastic_net': (<class 'sklearn.linear_model._coordinate_descent.ElasticNet'>, {}), 'lasso': (<class 'sklearn.linear_model._coordinate_descent.Lasso'>, {}), 'linear': (<class 'sklearn.linear_model._base.LinearRegression'>, {}), 'random_forest': (<class 'sklearn.ensemble._forest.RandomForestRegressor'>, {}), 'ridge': (<class 'sklearn.linear_model._ridge.Ridge'>, {}), 'support_vector_machine': (<class 'sklearn.svm._classes.SVR'>, {})}
- train(X_train: ndarray, y_train: ndarray, seed: int = None, kfolds: int = 10, n_iter: int = 100, optimize: bool = False, verbose: bool = False)[source]
Build and train a regression model.
- Parameters:
- X_train: np.ndarray
Train input data.
- y_train: np.ndarray
Train output data.
- seed: int, default None
Controls the shuffling applied for subsampling the data.
- kfolds: int, default 10
Number of folds in the k-fold cross validation.
- n_iter: int, default 100
Number of hyperparameter settings that are sampled. It trades off runtime and quality of the solution.
- optimize: bool, default False
If True, optimize the hyperparameters of the model and return the best model.
- verbose: bool, default False
If True, show logs.
pyccea.utils.stats module
- pyccea.utils.stats.statistical_comparison_between_independent_samples(x: ndarray, y: ndarray, alpha: float = 0.05, alternative: str = 'two-sided') Tuple[str, float, float, bool] [source]
Perform a statistical comparison between two independent samples.
This function compares two independent samples x and y using a statistical test (e.g., t-test) to determine if there is a significant difference between them. The test is two-sided by default, but can be configured to perform one-sided tests. It returns the test statistic name, the computed p-value, the test statistic, and whether the null hypothesis is rejected at the given significance level alpha.
- Parameters:
- xnp.ndarray
The first independent sample.
- ynp.ndarray
The second independent sample.
- alphafloat, optional
The significance level for the hypothesis test (default is 0.05).
- alternativestr, optional
Defines the alternative hypothesis. Options are ‘two-sided’, ‘greater’, or ‘less’ (default is ‘two-sided’).
- Returns:
- test_namestr
The name of the statistical test used (e.g., ‘t-test’, ‘mann-whitney-u’).
- p_comparisonfloat
The computed p-value of the test.
- statisticfloat
The test statistic from the hypothesis test.
- reject_nullbool
True if the null hypothesis is rejected at the given significance level, False otherwise.
- Raises:
- ValueError
If the input arrays x and y have different lengths or are not one-dimensional.
- ValueError
If the alternative argument is not one of ‘two-sided’, ‘greater’, or ‘less’.