Clustering¶
generate_clusters(df, field_name, sim_df, threshold=0.4, verbose=0, cluster_algorithm='greedy_incremental', filter_smaller=True)
¶
Generates clusters from a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
DataFrame with entities to cluster. |
required |
field_name
|
str
|
Name of the field with the entity information (e.g., |
required |
threshold
|
float
|
Similarity value above which entities will be considered similar, defaults to 0.4 |
0.4
|
sim_df
|
DataFrame
|
DataFrame with similarities ( |
required |
verbose
|
int
|
How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0 |
0
|
cluster_algorithm
|
str
|
Clustering algorithm to use. Options: - |
'greedy_incremental'
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with entities and the cluster they belong to. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Clustering algorithm is not supported |