Clustering¶

`generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True, **kwargs)` ¶

Generates clusters from a DataFrame.

This function supports several clustering algorithms that operate on pairwise similarity data. Each algorithm has different scalability, behavior, and underlying assumptions. Below is a summary of the available algorithms:

Clustering algorithms: - CDHIT or greedy_incremental: Greedy incremental clustering similar to CD-HIT. Entities are sorted by length, and each new element seeds a cluster; all items above the similarity threshold are assigned to the same cluster. Fast, deterministic, and suitable for sequence-length-dependent ordering.

- `greedy_cover_set` or `butina`:
    A greedy set-cover–style approach (similar to Butina clustering).
    Selects items with the largest number of neighbors above the
    threshold and forms clusters around them. Tends to produce compact,
    high-similarity groups.

- `connected_components`:
    Treats similarity relations above the threshold as graph edges and
    computes connected components. All entities connected (directly or
    transitively) via similarity ≥ threshold belong to the same
    cluster. Very fast and stable for large sparse similarity graphs.

- `bitbirch`:
    Clustering based on the BitBirch tree/hashing algorithm. Supports
    two modes:
        (1) fingerprint-based (e.g. SMILES → Morgan fingerprints), or
        (2) similarity-matrix-derived.
    Scales efficiently to large datasets and creates hierarchical,
    radius-based clusters.

- `umap`:
    Reduces high-dimensional fingerprints or similarity matrices into a
    low-dimensional manifold using UMAP, then applies agglomerative
    clustering. Useful when clusters are better separated in embedded
    space than in raw feature or similarity space.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with entities to cluster.	required
`field_name`	`str`	Name of the field with the entity information (e.g., `protein_sequence` or `structure_path`), defaults to 'sequence'.	required
`threshold`	`float`	Similarity value above which entities will be considered similar, defaults to 0.4	`0.4`
`sim_df`	`DataFrame`	DataFrame with similarities (`metric`) between `query` and `target`, it is the product of `calculate_similarity` function	required
`verbose`	`int`	How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0	`0`
`cluster_algorithm`	`str`	Clustering algorithm to use. Options: - `CDHIT` or `greedy_incremental` - `greedy_cover_set` - `connected_components` - `bitbirch` - `umap` Defaults to "greedy_incremental".	`'greedy_incremental'`
`filter_smaller`	`Optional[bool]`	Whether to filter smaller indices when constructing adjacency matrices in similarity-based algorithms, defaults to True.	`True`

Returns:

Type	Description
`np.ndarray`	DataFrame with entities and the cluster they belong to.

Raises:

Type	Description
`NotImplementedError`	Clustering algorithm is not supported

Clustering¶

generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True, **kwargs) ¶

`generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True, **kwargs)` ¶