Clustering¶
generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True, **kwargs)
¶
Generates clusters from a DataFrame.
This function supports several clustering algorithms that operate on pairwise similarity data. Each algorithm has different scalability, behavior, and underlying assumptions. Below is a summary of the available algorithms:
Clustering algorithms:
- CDHIT or greedy_incremental:
Greedy incremental clustering similar to CD-HIT. Entities are
sorted by length, and each new element seeds a cluster; all items
above the similarity threshold are assigned to the same cluster.
Fast, deterministic, and suitable for sequence-length-dependent
ordering.
- `greedy_cover_set` or `butina`:
A greedy set-cover–style approach (similar to Butina clustering).
Selects items with the largest number of neighbors above the
threshold and forms clusters around them. Tends to produce compact,
high-similarity groups.
- `connected_components`:
Treats similarity relations above the threshold as graph edges and
computes connected components. All entities connected (directly or
transitively) via similarity ≥ threshold belong to the same
cluster. Very fast and stable for large sparse similarity graphs.
- `bitbirch`:
Clustering based on the BitBirch tree/hashing algorithm. Supports
two modes:
(1) fingerprint-based (e.g. SMILES → Morgan fingerprints), or
(2) similarity-matrix-derived.
Scales efficiently to large datasets and creates hierarchical,
radius-based clusters.
- `umap`:
Reduces high-dimensional fingerprints or similarity matrices into a
low-dimensional manifold using UMAP, then applies agglomerative
clustering. Useful when clusters are better separated in embedded
space than in raw feature or similarity space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with entities to cluster. |
required |
field_name
|
str
|
Name of the field with the entity information (e.g., |
required |
threshold
|
float
|
Similarity value above which entities will be considered similar, defaults to 0.4 |
0.4
|
sim_df
|
DataFrame
|
DataFrame with similarities ( |
required |
verbose
|
int
|
How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0 |
0
|
cluster_algorithm
|
str
|
Clustering algorithm to use. Options: - |
'greedy_incremental'
|
filter_smaller
|
Optional[bool]
|
Whether to filter smaller indices when constructing adjacency matrices in similarity-based algorithms, defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
np.ndarray
|
DataFrame with entities and the cluster they belong to. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Clustering algorithm is not supported |