Clustering¶

`generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True)` ¶

Generates clusters from a DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with entities to cluster.	required
`field_name`	`str`	Name of the field with the entity information (e.g., `protein_sequence` or `structure_path`), defaults to 'sequence'.	required
`threshold`	`float`	Similarity value above which entities will be considered similar, defaults to 0.4	`0.4`
`sim_df`	`DataFrame`	DataFrame with similarities (`metric`) between `query` and `target`, it is the product of `calculate_similarity` function	required
`verbose`	`int`	How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0	`0`
`cluster_algorithm`	`str`	Clustering algorithm to use. Options: - `CDHIT` or `greedy_incremental` - `greedy_cover_set` - `connected_components` Defaults to "CDHIT".	`'greedy_incremental'`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with entities and the cluster they belong to.

Raises:

Type	Description
`NotImplementedError`	Clustering algorithm is not supported