Skip to content

Clustering

generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True, **kwargs)

Generates clusters from a DataFrame.

This function supports several clustering algorithms that operate on pairwise similarity data. Each algorithm has different scalability, behavior, and underlying assumptions. Below is a summary of the available algorithms:

Clustering algorithms: - CDHIT or greedy_incremental: Greedy incremental clustering similar to CD-HIT. Entities are sorted by length, and each new element seeds a cluster; all items above the similarity threshold are assigned to the same cluster. Fast, deterministic, and suitable for sequence-length-dependent ordering.

- `greedy_cover_set` or `butina`:
    A greedy set-cover–style approach (similar to Butina clustering).
    Selects items with the largest number of neighbors above the
    threshold and forms clusters around them. Tends to produce compact,
    high-similarity groups.

- `connected_components`:
    Treats similarity relations above the threshold as graph edges and
    computes connected components. All entities connected (directly or
    transitively) via similarity  threshold belong to the same
    cluster. Very fast and stable for large sparse similarity graphs.

- `bitbirch`:
    Clustering based on the BitBirch tree/hashing algorithm. Supports
    two modes:
        (1) fingerprint-based (e.g. SMILES  Morgan fingerprints), or
        (2) similarity-matrix-derived.
    Scales efficiently to large datasets and creates hierarchical,
    radius-based clusters.

- `umap`:
    Reduces high-dimensional fingerprints or similarity matrices into a
    low-dimensional manifold using UMAP, then applies agglomerative
    clustering. Useful when clusters are better separated in embedded
    space than in raw feature or similarity space.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with entities to cluster.

required
field_name str

Name of the field with the entity information (e.g., protein_sequence or structure_path), defaults to 'sequence'.

required
threshold float

Similarity value above which entities will be considered similar, defaults to 0.4

0.4
sim_df DataFrame

DataFrame with similarities (metric) between query and target, it is the product of calculate_similarity function

required
verbose int

How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0

0
cluster_algorithm str

Clustering algorithm to use. Options: - CDHIT or greedy_incremental - greedy_cover_set - connected_components - bitbirch - umap Defaults to "greedy_incremental".

'greedy_incremental'
filter_smaller Optional[bool]

Whether to filter smaller indices when constructing adjacency matrices in similarity-based algorithms, defaults to True.

True

Returns:

Type Description
np.ndarray

DataFrame with entities and the cluster they belong to.

Raises:

Type Description
NotImplementedError

Clustering algorithm is not supported