qbiocode.evaluation.dataset_evaluation module#
Summary#
Functions:
This function evaluates a dataset and returns a transposed summary DataFrame with various statistical measures, derived from the dataset. |
|
Get coefficient of variance |
|
Measure the manifold complexity by fitting Isomap and analyzing the geodesic vs. |
|
Get condition number of a matrix. |
|
Get the number of features, samples, and feature-to-sample ratio from a DataFrame. |
|
Calculate entropy of the target variable |
|
Calculate Fisher Discriminant Ratio for a given dataset. |
|
Calculate the fractal dimension of the data using Higuchi's method |
|
Get intrinsic dimension of the data using lPCA from skdim. |
|
Calculate the mean log density of the data |
|
Calculate get count of low variance features |
|
Compute third and fourth order moments of the data |
|
Calculate mutual information via sklearn |
|
Calculate nonzero values in the data |
|
Calculate Total Correlation |
|
Get variance |
|
Get volume of the data from Convex Hull |
Reference#
- get_dimensions(df)[source]#
Get the number of features, samples, and feature-to-sample ratio from a DataFrame. :type df: :param df: Dataset in pandas with observation in rows, features in columns :type df: pandas.DataFrame
- Returns:
- (num_features, num_samples, ratio)
num_features (int): Number of features in the DataFrame
num_samples (int): Number of samples in the DataFrame
ratio (float): Feature-to-sample ratio
- Return type:
tuple
- get_intrinsic_dim(df)[source]#
Get intrinsic dimension of the data using lPCA from skdim. :type df: :param df: Dataset in pandas with observation in rows, features in columns :type df: pandas.DataFrame
- Returns:
Intrinsic dimension of the data
- Return type:
float
- get_condition_number(df)[source]#
- Get condition number of a matrix.
A function with a high condition number is said to be ill-conditioned. Ill conditioned matrices produce large errors in its output even with small errors in its input. Low condition number means more stable errors.
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
condition number of the matrix represented in df
- Return type:
float
- get_fdr(df, y)[source]#
Calculate Fisher Discriminant Ratio for a given dataset.
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
y (int) – supervised binary class label
- Returns:
Fisher Discriminant ratio
- Return type:
float
- get_total_correlation(df)[source]#
Calculate Total Correlation
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
Total correlation
- Return type:
float
- get_mutual_information(df, y)[source]#
Calculate mutual information via sklearn
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
y (int) – supervised binary class label
- Returns:
Mutual information
- Return type:
float
- get_variance(df)[source]#
Get variance
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
Mean variance std_var (float): Standard deviation of variance
- Return type:
avg_var (float)
- get_coefficient_var(df)[source]#
Get coefficient of variance
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
Mean coefficient of variance std_var (float): Standard deviation of coefficient of variance
- Return type:
avg_co_of_v (float)
- get_nnz(df)[source]#
Calculate nonzero values in the data
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
nonzero count
- Return type:
int
- get_low_var_features(df, num_features)[source]#
Calculate get count of low variance features
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
num_features (int) – number of features in the dataset
- Raises:
ValueError – If no feature is strong enough to keep
- Returns:
count of features with low variance
- Return type:
int
- get_log_density(df)[source]#
Calculate the mean log density of the data
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
mean log kernel density
- Return type:
float
- get_fractal_dim(df, k_max)[source]#
Calculate the fractal dimension of the data using Higuchi’s method
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
k_max (int) – Maximum number of k values to use in the calculation
- Returns:
Fractal dimension of the data
- Return type:
float
- get_moments(df)[source]#
Compute third and fourth order moments of the data
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
Mean skewness std_skew (float): Standard deviation of skewness avg_kurt (float): Mean kurtosis std_kurt (float): Standard deviation of kurtosis
- Return type:
avg_skew (float)
- get_entropy(y)[source]#
Calculate entropy of the target variable
- Parameters:
y (int) – supervised binary class label
- Returns:
mean entropy std_y_entropy (flat): standard deviation of entropy
- Return type:
avg_y_entropy (float)
- get_volume(df)[source]#
Get volume of the data from Convex Hull
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
- Returns:
Volume of the space spanned by the features of the data
- Return type:
volume (float)
- get_complexity(df, n_neighbors=10, n_components=2)[source]#
Measure the manifold complexity by fitting Isomap and analyzing the geodesic vs. Euclidean distances. This function computes the reconstruction error of the Isomap algorithm, which serves as an indicator of the complexity of the manifold represented by the data.
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
n_neighbors – Number of neighbors for the Isomap algorithm. Default value 10
n_components – Number of components (dimensions) for Isomap projection. Default value 2
- Returns:
- float
The reconstruction error of the Isomap model, which indicates the complexity of the manifold.
reconstruction_error: The residual error of geodesic distances
- Return type:
reconstruction_error
- evaluate(df, y, file)[source]#
This function evaluates a dataset and returns a transposed summary DataFrame with various statistical measures, derived from the dataset. Using the functions defined above, it computes intrinsic dimension, condition number, Fisher Discriminant Ratio, total correlation, mutual information, variance, coefficient of variation, data sparsity, low variance features, data density, fractal dimension, data distributions (skewness and kurtosis), entropy of the target variable, and manifold complexity. The summary DataFrame is transposed for easier readability and contains the dataset name, number of features, number of samples, feature-to-sample ratio, and various statistical measures. This function is useful for quickly summarizing the characteristics of a dataset, especially in the context of machine learning and data analysis, allowing you to correlate the dataset’s properties with its performance in predictive modeling tasks.
- Parameters:
df (pandas.DataFrame) – Dataset in pandas with observation in rows, features in columns
y (int) – supervised binary class label
file (str) – Name of the dataset file for identification in the summary DataFrame
- Returns:
Summary DataFrame containing various statistical measures of the dataset
- Return type:
transposed (pandas.DataFrame)