qbiocode.evaluation.dataset_evaluation module#

Summary#

Functions:

evaluate

This function evaluates a dataset and returns a transposed summary DataFrame with various statistical measures, derived from the dataset.

get_coefficient_var

Get coefficient of variance

get_complexity

Measure the manifold complexity by fitting Isomap and analyzing the geodesic vs.

get_condition_number

Get condition number of a matrix.

get_dimensions

Get the number of features, samples, and feature-to-sample ratio from a DataFrame.

get_entropy

Calculate entropy of the target variable

get_fdr

Calculate Fisher Discriminant Ratio for a given dataset.

get_fractal_dim

Calculate the fractal dimension of the data using Higuchi's method

get_intrinsic_dim

Get intrinsic dimension of the data using lPCA from skdim.

get_log_density

Calculate the mean log density of the data

get_low_var_features

Calculate get count of low variance features

get_moments

Compute third and fourth order moments of the data

get_mutual_information

Calculate mutual information via sklearn

get_nnz

Calculate nonzero values in the data

get_total_correlation

Calculate Total Correlation

get_variance

Get variance

get_volume

Get volume of the data from Convex Hull

Reference#

get_dimensions(df)[source]#

Get the number of features, samples, and feature-to-sample ratio from a DataFrame. :type df: :param df: Dataset in pandas with observation in rows, features in columns :type df: pandas.DataFrame

Returns:

(num_features, num_samples, ratio)
  • num_features (int): Number of features in the DataFrame

  • num_samples (int): Number of samples in the DataFrame

  • ratio (float): Feature-to-sample ratio

Return type:

tuple

get_intrinsic_dim(df)[source]#

Get intrinsic dimension of the data using lPCA from skdim. :type df: :param df: Dataset in pandas with observation in rows, features in columns :type df: pandas.DataFrame

Returns:

Intrinsic dimension of the data

Return type:

float

get_condition_number(df)[source]#
Get condition number of a matrix.

A function with a high condition number is said to be ill-conditioned. Ill conditioned matrices produce large errors in its output even with small errors in its input. Low condition number means more stable errors.

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

condition number of the matrix represented in df

Return type:

float

get_fdr(df, y)[source]#

Calculate Fisher Discriminant Ratio for a given dataset.

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • y (int) – supervised binary class label

Returns:

Fisher Discriminant ratio

Return type:

float

get_total_correlation(df)[source]#

Calculate Total Correlation

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

Total correlation

Return type:

float

get_mutual_information(df, y)[source]#

Calculate mutual information via sklearn

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • y (int) – supervised binary class label

Returns:

Mutual information

Return type:

float

get_variance(df)[source]#

Get variance

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

Mean variance std_var (float): Standard deviation of variance

Return type:

avg_var (float)

get_coefficient_var(df)[source]#

Get coefficient of variance

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

Mean coefficient of variance std_var (float): Standard deviation of coefficient of variance

Return type:

avg_co_of_v (float)

get_nnz(df)[source]#

Calculate nonzero values in the data

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

nonzero count

Return type:

int

get_low_var_features(df, num_features)[source]#

Calculate get count of low variance features

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • num_features (int) – number of features in the dataset

Raises:

ValueError – If no feature is strong enough to keep

Returns:

count of features with low variance

Return type:

int

get_log_density(df)[source]#

Calculate the mean log density of the data

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

mean log kernel density

Return type:

float

get_fractal_dim(df, k_max)[source]#

Calculate the fractal dimension of the data using Higuchi’s method

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • k_max (int) – Maximum number of k values to use in the calculation

Returns:

Fractal dimension of the data

Return type:

float

get_moments(df)[source]#

Compute third and fourth order moments of the data

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

Mean skewness std_skew (float): Standard deviation of skewness avg_kurt (float): Mean kurtosis std_kurt (float): Standard deviation of kurtosis

Return type:

avg_skew (float)

get_entropy(y)[source]#

Calculate entropy of the target variable

Parameters:

y (int) – supervised binary class label

Returns:

mean entropy std_y_entropy (flat): standard deviation of entropy

Return type:

avg_y_entropy (float)

get_volume(df)[source]#

Get volume of the data from Convex Hull

Parameters:

df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

Returns:

Volume of the space spanned by the features of the data

Return type:

volume (float)

get_complexity(df, n_neighbors=10, n_components=2)[source]#

Measure the manifold complexity by fitting Isomap and analyzing the geodesic vs. Euclidean distances. This function computes the reconstruction error of the Isomap algorithm, which serves as an indicator of the complexity of the manifold represented by the data.

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • n_neighbors – Number of neighbors for the Isomap algorithm. Default value 10

  • n_components – Number of components (dimensions) for Isomap projection. Default value 2

Returns:

float

The reconstruction error of the Isomap model, which indicates the complexity of the manifold.

  • reconstruction_error: The residual error of geodesic distances

Return type:

  • reconstruction_error

evaluate(df, y, file)[source]#

This function evaluates a dataset and returns a transposed summary DataFrame with various statistical measures, derived from the dataset. Using the functions defined above, it computes intrinsic dimension, condition number, Fisher Discriminant Ratio, total correlation, mutual information, variance, coefficient of variation, data sparsity, low variance features, data density, fractal dimension, data distributions (skewness and kurtosis), entropy of the target variable, and manifold complexity. The summary DataFrame is transposed for easier readability and contains the dataset name, number of features, number of samples, feature-to-sample ratio, and various statistical measures. This function is useful for quickly summarizing the characteristics of a dataset, especially in the context of machine learning and data analysis, allowing you to correlate the dataset’s properties with its performance in predictive modeling tasks.

Parameters:
  • df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns

  • y (int) – supervised binary class label

  • file (str) – Name of the dataset file for identification in the summary DataFrame

Returns:

Summary DataFrame containing various statistical measures of the dataset

Return type:

transposed (pandas.DataFrame)