Skip to content

RepresentationEngine

Bases: Module

A class for generating sequence representations using pre-trained models, with flexible pooling and device management.

Attributes:

  • device : str The device ('cuda', 'mps', or 'cpu') on which the model will be run.
  • batch_size : int The size of data batches for processing.
  • model : torch.nn.Module The loaded pre-trained model for generating representations.
  • tokenizer : AutoTokenizer or T5Tokenizer The tokenizer associated with the pre-trained model.
  • lab : str Identifier for the lab or group associated with the model, e.g., 'Rostlab', 'facebook', or 'ElnaggarLab'.
  • dimension : int Dimension of the output representations.
  • model_name : str Name of the loaded pre-trained model.
  • head : Optional[torch.nn.Module] Optional head layer added to the model for specific tasks.

Parameters:

  • model : str Name of the pre-trained model to load.
  • batch_size : int Batch size for sequence processing.

Methods:

  • move_to_device(device: str) Sets the device for computation, either 'cpu', 'mps', or 'cuda'.

  • add_head(head: torch.nn.Module) Adds an optional head module to the model, which can be used for task-specific outputs.

  • compute_representations(sequences: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor] Generates representations for a list of input sequences. Supports average pooling or CLS token extraction.

  • compute_batch(batch: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor] Processes a batch of sequences and extracts representations according to specified pooling methods.

  • dim() -> int Returns the dimension of the representation layer of the model.

  • forward(batch, labels=None) Performs a forward pass through the model, with an optional head layer if added.

  • max_len() -> int Returns the maximum sequence length allowed for the model, based on lab specifications.

  • print_trainable_parameters() Prints the total and trainable parameter counts, as well as the percentage of trainable parameters.

  • _load_model(model: str) Internal method to load a pre-trained model and tokenizer based on the specified model name.

  • _divide_in_batches(sequences: List[str], batch_size: int) -> List[List[str]] Divides a list of sequences into batches and processes each sequence for compatibility with the model.

Method Details

  • move_to_device(device: str): Sets the device (cpu or cuda) on which the model will run. Updates the self.device attribute.

  • add_head(head: torch.nn.Module): Adds a task-specific head module to the model, stored in self.head. This head will be applied in the forward() method for additional processing.

  • compute_representations(sequences: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor]: Generates vector representations for a list of input sequences. Either average_pooling or cls_token can be used to extract representations, but not both simultaneously. Divides the sequences into batches and processes each batch.

  • compute_batch(batch: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor]: Processes a single batch of sequences, creating representations using either average pooling or CLS token extraction. Returns a list of tensor representations for each sequence in the batch.

  • dim() -> int: Returns the dimension of the representation layer based on the loaded model.

  • forward(batch, labels=None): Executes a forward pass through the model with the given batch, applying the optional head if present. Returns the output representation tensor.

  • max_len() -> int: Returns the maximum sequence length allowed by the model based on the lab attribute.

  • print_trainable_parameters(): Calculates and prints the total number of model parameters, the number of trainable parameters, and the percentage of trainable parameters.

  • _load_model(model: str): Internal method that loads the specified pre-trained model and tokenizer from either Rostlab, Facebook, or ElnaggarLab. Sets model attributes such as dimension, tokenizer, and model.

  • _divide_in_batches(sequences: List[str], batch_size: int) -> List[List[str]]: Splits a list of sequences into smaller batches according to the batch size. Adjusts sequences to meet model-specific formatting requirements.

This class is intended for users who need a flexible way to compute and handle representations from pre-trained models, supporting GPU (CUDA or MPS) or CPU computation.