Similarity calculation¶
calculate_similarity(df_query, df_target=None, data_type='protein', similarity_metric='mmseqs+prefilter', field_name='sequence', threshold=0.3, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, distance='tanimoto', bits=1024, radius=2, denominator='shortest', representation='3di+aa', config=None)
¶
Calculate similarity between entities in
df_query
and df_target
. Entities can be
biological sequences (nucleic acids or proteins),
protein structures or small molecules (in SMILES format).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_query |
DataFrame
|
DataFrame with query entities to calculate similarities |
required |
df_target |
DataFrame
|
DataFrame with target entities to calculate similarities. If not specified, the |
None
|
data_type |
str
|
Biochemical data_type to which the data belongs. Options: |
'protein'
|
similarity_metric |
str
|
Similarity function to use. Options: - |
'mmseqs+prefilter'
|
field_name |
str
|
Name of the field with the entity information (e.g., |
'sequence'
|
threshold |
float
|
Similarity value above which entities will be considered similar, defaults to 0.3 |
0.3
|
threads |
int
|
Number of threads available for parallalelization, defaults to cpu_count() |
cpu_count()
|
verbose |
int
|
How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0 |
0
|
save_alignment |
bool
|
Save file with similarity calculations, defaults to False |
False
|
filename |
str
|
Filename where to save the similarity calculations requires |
None
|
distance |
str
|
Distance metrics for small molecule comparison. Currently, it is restricted to Tanimoto distance will be extended in future patches; if interested in a specific metric please let us know. Options: - |
'tanimoto'
|
bits |
int
|
Number of bits for ECFP, defaults to 1024 |
1024
|
radius |
int
|
Radius for ECFP calculation, defaults to 2 |
2
|
denominator |
str
|
Denominator for sequence alignments, refers to which lenght to be used as denominator for calculating the sequence identity. Options: - |
'shortest'
|
representation |
str
|
Representation for protein structures as interpreted by |
'3di+aa'
|
config |
dict
|
Dictionary with options for EMBOSS needle module Default values: - "gapopen": 10, - "gapextend": 0.5, - "endweight": True, - "endopen": 10, - "endextend": 0.5, - "matrix": "EBLOSUM62" |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
DataFrame with similarities ( |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Biochemical data_type is not supported see |
NotImplementedError
|
Similarity metric is not supported see |
sim_df2mtx(sim_df, threshold=0.05)
¶
Generates a similarity matrix from
a DataFrame with the results from similarity
calculations in the form of query
, target
,
and metric
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sim_df |
DataFrame
|
DataFrame with similarity calculations with the columns |
required |
threshold |
float
|
Similarity threshold below which elements are considered dissimilar, defaults to 0.05 |
0.05
|
Returns:
Type | Description |
---|---|
spr.bsr_matrix
|
Sparse similarity matrix with shape nxn where n are the unique elements in the |