Skip to content

AutoPeptideML

Main class for handling the automatic development of bioactive peptide ML predictors.

Initialize instance of the AutoPeptideML class

Parameters:

Name Type Description Default
verbose bool

Whether to output information, defaults to True

True
threads int

Number of threads to compute parallelise processes, defaults to cpu_count()

cpu_count()
seed int

Pseudo-random number generator seed. Important for reproducibility, defaults to 42

42

autosearch_negatives(df_pos, positive_tags, proportion=1.0, target_db=None)

Method for searching bioactive databases for peptides

Parameters:

Name Type Description Default
df_pos DataFrame

DataFrame with positive peptides.

required
positive_tags List[str]

List of names of bioactivities that may overlap with the target bioactivities.

required
proportion float

Negative:Positive ration in the new dataset. Defaults to 1.0., defaults to 1.0.

1.0
target_db Optional[str]

Path to CSV containing a database with columns sequence and bioactivities.

None

Returns:

Type Description
pd.DataFrame

New dataset with both positive and negative peptides.

balance_samples(df)

Oversample the underrepresented class in the DataFrame.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with positive and negative peptides to be balanced.

required

Returns:

Type Description
pd.DataFrame

DataFrame with balanced number of positive and negative peptides.

compute_representations(datasets, re)

Use a Protein Representation Model, loaded with the RepresentationEngine class to compute representations for the peptides in the dataasets.

Parameters:

Name Type Description Default
datasets Dict[str, DataFrame]

dictionary with the dataset partitions as DataFrames. Output from the method train_test_partition.

required
re RepresentationEngine

class with a Protein Representation Model.

required

Returns:

Type Description
dict

Dictionary with pd.DataFrame id column as keys and the representation of the sequence column as values.

curate_dataset(dataset, outputdir=None)

Load a DataFrame or use one already loaded and then remove all entries with non-canonical residues or repeated sequences.

Parameters:

Name Type Description Default
dataset Union[str, DataFrame]

Dataset or path to dataset.

required
outputdir str

Path were to save the curated dataset, defaults to None

None

Returns:

Type Description
pd.DataFrame

Curated dataset.

evaluate_model(best_model, test_df, id2rep, outputdir)

Evaluate an ensemble model.

Parameters:

Name Type Description Default
best_model list

List of models with a predict_proba method.

required
test_df DataFrame

Evaluation dataset with id, sequence and Y columns.

required
id2rep dict

Dictionary with keys being the id and the values the peptide representations.

required
outputdir str

Path were to save the evaluation data.

required

Returns:

Type Description
pd.DataFrame

Dataset with the evaluation metrics.

hpo_train(config, train_df, id2rep, folds, outputdir, n_jobs=1)

Hyperparameter Optimisation and training.

Parameters:

Name Type Description Default
config dict

dictionary with hyperparameter search space.

required
train_df DataFrame

Training dataset with id column and Y column with the bioactivity target.

required
id2rep dict

Dictionary with pd.DataFrame id column as keys and the representation of the sequence column as values.

required
folds list

List with the training/validation folds

required
outputdir str

Path to the directory where information should be saved.

required
n_jobs int

Number of threads to parallelise the training, defaults to 1.

1

Returns:

Type Description
list

List with the models that comprise the final ensemble.

predict(df, re, ensemble_path, outputdir, df_repr=None, backend='onnx')

Predicts scores and uncertainties for input sequences using a pre-trained ensemble of models.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing sequences to be predicted. Must include a 'sequence' column.

required
re RepresentationEngine

Representation engine for computing feature representations of sequences.

required
ensemble_path str

Path to the directory containing the ensemble of trained models.

required
outputdir str

Directory where prediction results will be saved. Created if it does not exist.

required
df_repr list

Precomputed representations of the input sequences. If None, representations are computed.

None
backend str

Backend used for prediction. Supported values are 'onnx' (default) and 'joblib'.

'onnx'

Returns:

Type Description
pd.DataFrame

DataFrame with predictions, including 'score' (average prediction) and 'score_uncertainty' (standard deviation).

Raises:

Type Description
ImportError

If required libraries for the selected backend are not installed.

NotImplementedError

If an unsupported backend is specified. Notes: - Converts joblib models to ONNX format if backend='onnx' and joblib models are provided. - Saves predictions to 'predictions.csv' in the specified outputdir. Example: >>> from mymodule import RepresentationEngine, Predictor >>> predictor = Predictor(verbose=True) >>> df = pd.DataFrame({'sequence': ['ATCG', 'GCTA']}) >>> re = RepresentationEngine() >>> predictions = predictor.predict( ... df, re, ensemble_path='./ensemble', outputdir='./output' ... ) >>> print(predictions)

train_test_partition(df, threshold=0.3, test_size=0.2, denominator='n_aligned', alignment=None, outputdir='./splits')

Novel homology partitioning algorithm for generating independent hold-out evaluation sets.

This method partitions the provided dataset into training and testing sets based on sequence similarity. It ensures that sequences in the training and testing sets do not exceed a specified sequence identity threshold, resulting in distinct datasets for evaluation.

:example: data = pd.DataFrame({'id': [...], 'sequence': [...], 'Y': [...]}) partitioned_data = train_test_partition( df=data, threshold=0.4, test_size=0.25, denominator='shortest', alignment='needle', outputdir='./data_splits' )

Parameters:

Name Type Description Default
df DataFrame

Dataset to partition with the following columns: id, sequence, and Y.

required
threshold float

Maximum sequence identity allowed between sequences in training and evaluation sets. Sequences exceeding this threshold in similarity will not appear in both sets. Defaults to 0.3.

0.3
test_size float

Proportion of samples in evaluation (test) set. A float between 0 and 1, where 0.2 means 20% of the dataset will be allocated to the test set. Defaults to 0.2.

0.2
denominator str

Denominator used to calculate sequence identity between pairs of sequences. Options include: - 'shortest': The shortest sequence length. - 'longest': The longest sequence length. - 'n_aligned': The length of the aligned region between sequences.

'n_aligned'
alignment str

Sequence alignment method to compute similarity. Options include: - 'peptides': Peptide sequence alignment. - 'mmseqs': Local Smith-Waterman alignment. - 'mmseqs+prefilter': Fast alignment using Smith-Waterman with k-mer prefiltering. - 'needle': Global Needleman-Wunsch alignment.

None
outputdir str

Directory where the resulting train and test CSV files will be saved. Defaults to './splits'.

'./splits'

Returns:

Type Description
Dict[str, pd.DataFrame]

A dictionary containing the training and testing DataFrames: - 'train': The DataFrame for the training set. - 'test': The DataFrame for the testing set.

Raises:

Type Description
FileNotFoundError

If the output directory cannot be created or accessed.

ValueError

If an unsupported alignment method is specified.

train_val_partition(df, method='random', threshold=0.5, alignment='peptides', denominator='n_aligned', n_folds=10, outputdir='./folds')

Method for generating n training/validation folds for cross-validation.

Parameters:

Name Type Description Default
df DataFrame

Training dataset with id, sequence, and Y columns.

required
method str

Method for generating the folds. Options available: random through sklearn.model_selection.StratifiedKFold or graph-part through graphpart.stratified_k_fold, defaults to random.

'random'
threshold float

If mode is graph-part, maximum sequence identity allowed between sequences in training and evaluation sets, defaults to 0.5

0.5
denominator str

Denominator to calculate sequence identity. Options; - shortest: Shortest sequence length - longest: Longest sequence length - n_aligned: Length of the alignment

'n_aligned'
alignment str

If mode is graph-part, alignment algorithm to use. Options available: mmseqs (local Smith-Waterman alignment), mmseqs+prefilter (local fast alignment Smith-Waterman + k-mer prefiltering), and needle (global Needleman-Wunch alignment), defaults to 'mmseqs+prefiler', defaults to 'mmseqs+prefilter'

'peptides'
n_folds int

Number of training/validation folds to generate, defaults to 10

10
outputdir str

Path where data should be saved, defaults to './folds'

'./folds'

Returns:

Type Description
list

List of training/validation folds