Similarity calculation¶

`calculate_similarity(df_query, df_target=None, data_type='protein', similarity_metric='mmseqs+prefilter', field_name='sequence', threshold=0.3, threads=cpu_count(), verbose=0, save_alignment=False, filename=None, distance='tanimoto', bits=1024, radius=2, denominator='shortest', representation='3di+aa', config=None)` ¶

Calculate similarity between entities in df_query and df_target. Entities can be biological sequences (nucleic acids or proteins), protein structures or small molecules (in SMILES format).

Parameters:

Name	Type	Description	Default
`df_query`	`DataFrame`	DataFrame with query entities to calculate similarities	required
`df_target`	`DataFrame`	DataFrame with target entities to calculate similarities. If not specified, the `df_query` will be used as `df_target` as well, defaults to None	`None`
`data_type`	`str`	Biochemical data_type to which the data belongs. Options: `protein`, `DNA`, `RNA`, or `small_molecule`; defaults to 'protein'	`'protein'`
`similarity_metric`	`str`	Similarity function to use. Options: - `protein`: `mmseqs` (local alignment), `mmseqs+prefilter` (fast local alignment), `needle` (global alignment), or `foldseek` (structural alignment). - `DNA` or `RNA`: `mmseqs` (local alignment), `mmseqs+prefilter` (fast local alignment), or `needle` (global alignment). - `small molecule`: `scaffold` (boolean comparison of Bemis-Murcko scaffolds: either identical or not) or `fingerprint` (Tanimoto distance between ECFP (extended connectivity fingerprints)) Defaults to `mmseqs+prefilter`.	`'mmseqs+prefilter'`
`field_name`	`str`	Name of the field with the entity information (e.g., `protein_sequence` or `structure_path`), defaults to 'sequence'.	`'sequence'`
`threshold`	`float`	Similarity value above which entities will be considered similar, defaults to 0.3	`0.3`
`threads`	`int`	Number of threads available for parallalelization, defaults to cpu_count()	`cpu_count()`
`verbose`	`int`	How much information will be displayed. Options: - 0: Errors, - 1: Warnings, - 2: All Defaults to 0	`0`
`save_alignment`	`bool`	Save file with similarity calculations, defaults to False	`False`
`filename`	`str`	Filename where to save the similarity calculations requires `save_alignment` set to `True`, defaults to None	`None`
`distance`	`str`	Distance metrics for small molecule comparison. Currently, it is restricted to Tanimoto distance will be extended in future patches; if interested in a specific metric please let us know. Options: - `tanimoto`: Calculates the Tanimoto distance Defaults to 'tanimoto'.	`'tanimoto'`
`bits`	`int`	Number of bits for ECFP, defaults to 1024	`1024`
`radius`	`int`	Radius for ECFP calculation, defaults to 2	`2`
`denominator`	`str`	Denominator for sequence alignments, refers to which lenght to be used as denominator for calculating the sequence identity. Options: - `shortest`: The shortest sequence of the pair - `longest`: The longest sequence of the pair (recomended only for peptides) - `n_aligned`: Full alignment length (recomended with global alignment) Defaults to 'shortest'	`'shortest'`
`representation`	`str`	Representation for protein structures as interpreted by `Foldseek`. Options: - `3di`: 3D interactions vocabulary. - `3di+aa`: 3D interactions vocabulary and amino acid sequence. - `TM`: global structural alignment (slow) Defaults to '3di+aa'	`'3di+aa'`
`config`	`dict`	Dictionary with options for EMBOSS needle module Default values: - "gapopen": 10, - "gapextend": 0.5, - "endweight": True, - "endopen": 10, - "endextend": 0.5, - "matrix": "EBLOSUM62"	`None`

Returns:

Type	Description
`pd.DataFrame`	DataFrame with similarities (`metric`) between `query` and `target`. `query` and `target` are named as the indexes obtained from the `pd.unique` function on the corresponding input DataFrames.

Raises:

Type	Description
`NotImplementedError`	Biochemical data_type is not supported see `data_type`.
`NotImplementedError`	Similarity metric is not supported see `similarity_algorithm`

`sim_df2mtx(sim_df, threshold=0.05)` ¶

Generates a similarity matrix from a DataFrame with the results from similarity calculations in the form of query, target, and metric.

Parameters:

Name	Type	Description	Default
`sim_df`	`DataFrame`	DataFrame with similarity calculations with the columns `query`, `target`, and `metric`.	required
`threshold`	`float`	Similarity threshold below which elements are considered dissimilar, defaults to 0.05	`0.05`

Returns:

Type	Description
`spr.bsr_matrix`	Sparse similarity matrix with shape nxn where n are the unique elements in the `query` column.

Similarity calculation¶

sim_df2mtx(sim_df, threshold=0.05) ¶

`sim_df2mtx(sim_df, threshold=0.05)` ¶