Skip to content

Dataset Generator

HestiaGenerator(data, verbose=True)

Class for generating multiple Dataset partitions for generalisation evaluation.

Initialise class


Name Type Description Default
data DataFrame

DataFrame with the original data from which datasets will be generated.


calculate_augood(results, target_df, target_field_name, target_embds=None, return_weights=False)

Calculate the 'area under the GOOD curve' (AU-GOOD) metric.

This function calculates an AU-GOOD score by computing a weighted metric from similarity values obtained by comparing target deployment distribution to the training distribution. It returns both the weighted GOOD curve values and the AU-GOOD score.


Name Type Description Default
results Dict[float, float]

A dictionary where keys are bins or thresholds (float) and values are metrics or counts associated with each bin.

target_df DataFrame

A DataFrame containing the target data for similarity comparison. The column specified by target_field_name will be used to populate the similarity arguments for comparison.

target_field_name Optional[str]

Name of the field in target_df that contains target values for comparison.

target_embds Optional[ndarray]

A NumPy array containing the target embeddings for similarity calculation.

return_weights bool

Return histogram values for train-deployment similarities



Type Description
Union[Tuple[np.ndarray, float], Tuple[np.ndarray, float, np.ndarray]]

A tuple containing: - good_curve (np.ndarray): Array of weighted values representing the GOOD curve. - au_good (float): The calculated area under the GOOD curve. and optionally: - weights (np.ndarray): Array of weights representing train-deployment similarities

calculate_partitions(sim_args=None, sim_df=None, label_name=None, min_threshold=0.0, threshold_step=0.05, test_size=0.2, valid_size=0.1, partition_algorithm='ccpart', random_state=42, verbose=1, n_partitions=None)

Calculates multiple partitions of a dataset for training, validation, and testing based on sequence similarity. Supports two partitioning algorithms: ccpart and graph_part. Additionally, it computes partitions for different similarity thresholds and random partitions.


Example of partitioning with a similarity threshold of 0.3 and a test size of 0.2

partitions = calculate_partitions( sim_args=similarity_args, label_name='Y', min_threshold=0.2, threshold_step=0.05, test_size=0.2, partition_algorithm='ccpart', random_state=42 )

Accessing the partitions for a specific threshold

train_set = partitions[0.3]['train'] valid_set = partitions[0.3]['valid'] test_set = partitions[0.3]['test']


Name Type Description Default
sim_args Optional[SimArguments]

Object containing the similarity parameters for partitioning. This includes options for calculating sequence similarity, such as the alignment method and similarity threshold. Defaults to None.

sim_df Optional[DataFrame]

Precomputed similarity DataFrame. If None, the similarity will be calculated using sim_args.

label_name Optional[str]

The name of the label column for the dataset. Defaults to None.

min_threshold Optional[float]

The minimum similarity threshold to start partitioning. Defaults to 0.0.

threshold_step Optional[float]

The step size for varying the similarity threshold during partitioning. Defaults to 0.05.

test_size Optional[float]

The proportion of the dataset to allocate to the test set. Defaults to 0.2.

valid_size Optional[float]

The proportion of the training set to allocate to the validation set. Defaults to 0.1.

verbose int

Verbosity level for process logging, where higher values increase output detail.

partition_algorithm Optional[str]

The partitioning algorithm to use. Options are: - 'ccpart': Connected components algorithm that puts in testing the smallest unconnected clusters. - 'graph_part': GraphPart partitioning. - 'butina': Butina split - Connected components algorithm that puts in testing random clusters. Defaults to 'ccpart'.

random_state Optional[int]

The random seed for reproducibility. Defaults to 42.

n_partitions Optional[int]

The number of partitions to create when using graph_part. Defaults to None.



Type Description

A dictionary containing the partitions for each threshold. The dictionary has keys: - train: DataFrame for the training set. - valid: DataFrame for the validation set. - test: DataFrame for the test set. - clusters: The clusters formed by the partitioning algorithm. - For random partitions, the key 'random' will contain the train, valid, and test sets.


Type Description

If an unsupported partition algorithm is specified.


Calculate pairwise similarity between all the elements in the dataset.


Name Type Description Default
sim_args SimArguments

See similarity arguments entry.


compare_models(model_results, statistical_test='wilcoxon') staticmethod

Compare the generalisation capabilities of n models against each other, providing p-values for every possible pair of models measuring how likely is model A to be better performing than model B.


Name Type Description Default
model_results Dict[str, Union[List[float], ndarray]]

Dictionary with model name as key and a list with the ordered performance values of the model at different thresholds.

statistical_test str

Statistical test to compute the model differences. Currently supported: - wilcoxon: Wilcoxon ranked-sum test Defaults to 'wilcoxon'



Load partition indexes if they have already being calculated.


Name Type Description Default
data_path str

Path to saved partition indexes.



Load similarity calculation from file.


Name Type Description Default
output_path str

File with similarity calculations.


save_precalculated(output_path, include_metada=True)

Save partition indexes to disk for quickier re-running.


Name Type Description Default
output_path str

Path where partition indexes should be saved.


SimArguments(data_type='protein', field_name='sequence', min_threshold=0.0, threads=cpu_count(), verbose=0, save_alignment=False, filename='alignment', sim_function=None, bits=None, radius=None, fingerprint=None, denominator=None, representation=None, prefilter=None, alignment_algorithm=None, query_embds=None, target_embds=None, target_df=None, needle_config=None, **kwargs)

Dataclass with the inputs for similarity calculation.