Skip to content

Clustering

generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True)

Generates clusters from a DataFrame.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with entities to cluster.

required
field_name str

Name of the field with the entity information (e.g., protein_sequence or structure_path), defaults to 'sequence'.

required
threshold float

Similarity value above which entities will be considered similar, defaults to 0.4

0.4
sim_df DataFrame

DataFrame with similarities (metric) between query and target, it is the product of calculate_similarity function

required
verbose int

How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0

0
cluster_algorithm str

Clustering algorithm to use. Options: - CDHIT or greedy_incremental - greedy_cover_set - connected_components Defaults to "CDHIT".

'greedy_incremental'

Returns:

Type Description
pd.DataFrame

DataFrame with entities and the cluster they belong to.

Raises:

Type Description
NotImplementedError

Clustering algorithm is not supported