Skip to content

Similarity calculation

calculate_similarity(df_query, df_target=None, data_type='protein', similarity_metric='mmseqs+prefilter', field_name='sequence', threshold=0.3, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, distance='tanimoto', bits=1024, radius=2, denominator='shortest', representation='3di+aa', config=None)

Calculate similarity between entities in df_query and df_target. Entities can be biological sequences (nucleic acids or proteins), protein structures or small molecules (in SMILES format).

Parameters:

Name Type Description Default
df_query DataFrame

DataFrame with query entities to calculate similarities

required
df_target DataFrame

DataFrame with target entities to calculate similarities. If not specified, the df_query will be used as df_target as well, defaults to None

None
data_type str

Biochemical data_type to which the data belongs. Options: protein, DNA, RNA, or small_molecule; defaults to 'protein'

'protein'
similarity_metric str

Similarity function to use. Options: - protein: mmseqs (local alignment), mmseqs+prefilter (fast local alignment), needle (global alignment), or foldseek (structural alignment). - DNA or RNA: mmseqs (local alignment), mmseqs+prefilter (fast local alignment), or needle (global alignment). - small molecule: scaffold (boolean comparison of Bemis-Murcko scaffolds: either identical or not) or fingerprint (Tanimoto distance between ECFP (extended connectivity fingerprints)) Defaults to mmseqs+prefilter.

'mmseqs+prefilter'
field_name str

Name of the field with the entity information (e.g., protein_sequence or structure_path), defaults to 'sequence'.

'sequence'
threshold float

Similarity value above which entities will be considered similar, defaults to 0.3

0.3
threads int

Number of threads available for parallalelization, defaults to cpu_count()

cpu_count()
verbose int

How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0

0
save_alignment bool

Save file with similarity calculations, defaults to False

False
filename str

Filename where to save the similarity calculations requires save_alignment set to True, defaults to None

None
distance str

Distance metrics for small molecule comparison. Currently, it is restricted to Tanimoto distance will be extended in future patches; if interested in a specific metric please let us know. Options: - tanimoto: Calculates the Tanimoto distance Defaults to 'tanimoto'.

'tanimoto'
bits int

Number of bits for ECFP, defaults to 1024

1024
radius int

Radius for ECFP calculation, defaults to 2

2
denominator str

Denominator for sequence alignments, refers to which lenght to be used as denominator for calculating the sequence identity. Options: - shortest: The shortest sequence of the pair - longest: The longest sequence of the pair (recomended only for peptides) - n_aligned: Full alignment length (recomended with global alignment) Defaults to 'shortest'

'shortest'
representation str

Representation for protein structures as interpreted by Foldseek. Options: - 3di: 3D interactions vocabulary. - 3di+aa: 3D interactions vocabulary and amino acid sequence. - TM: global structural alignment (slow) Defaults to '3di+aa'

'3di+aa'
config dict

Dictionary with options for EMBOSS needle module Default values: - "gapopen": 10, - "gapextend": 0.5, - "endweight": True, - "endopen": 10, - "endextend": 0.5, - "matrix": "EBLOSUM62"

None

Returns:

Type Description
pd.DataFrame

DataFrame with similarities (metric) between query and target. query and target are named as the indexes obtained from the pd.unique function on the corresponding input DataFrames.

Raises:

Type Description
NotImplementedError

Biochemical data_type is not supported see data_type.

NotImplementedError

Similarity metric is not supported see similarity_algorithm

sim_df2mtx(sim_df, threshold=0.05)

Generates a similarity matrix from a DataFrame with the results from similarity calculations in the form of query, target, and metric.

Parameters:

Name Type Description Default
sim_df DataFrame

DataFrame with similarity calculations with the columns query, target, and metric.

required
threshold float

Similarity threshold below which elements are considered dissimilar, defaults to 0.05

0.05

Returns:

Type Description
spr.bsr_matrix

Sparse similarity matrix with shape nxn where n are the unique elements in the query column.