RepresentationEngine¶
Bases: Module
A class for generating sequence representations using pre-trained models, with flexible pooling and device management.
Attributes:¶
- device : str The device ('cuda', 'mps', or 'cpu') on which the model will be run.
- batch_size : int The size of data batches for processing.
- model : torch.nn.Module The loaded pre-trained model for generating representations.
- tokenizer : AutoTokenizer or T5Tokenizer The tokenizer associated with the pre-trained model.
- lab : str Identifier for the lab or group associated with the model, e.g., 'Rostlab', 'facebook', or 'ElnaggarLab'.
- dimension : int Dimension of the output representations.
- model_name : str Name of the loaded pre-trained model.
- head : Optional[torch.nn.Module] Optional head layer added to the model for specific tasks.
Parameters:¶
- model : str Name of the pre-trained model to load.
- batch_size : int Batch size for sequence processing.
Methods:¶
-
move_to_device(device: str) Sets the device for computation, either 'cpu', 'mps', or 'cuda'.
-
add_head(head: torch.nn.Module) Adds an optional head module to the model, which can be used for task-specific outputs.
-
compute_representations(sequences: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor] Generates representations for a list of input sequences. Supports average pooling or CLS token extraction.
-
compute_batch(batch: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor] Processes a batch of sequences and extracts representations according to specified pooling methods.
-
dim() -> int Returns the dimension of the representation layer of the model.
-
forward(batch, labels=None) Performs a forward pass through the model, with an optional head layer if added.
-
max_len() -> int Returns the maximum sequence length allowed for the model, based on lab specifications.
-
print_trainable_parameters() Prints the total and trainable parameter counts, as well as the percentage of trainable parameters.
-
_load_model(model: str) Internal method to load a pre-trained model and tokenizer based on the specified model name.
-
_divide_in_batches(sequences: List[str], batch_size: int) -> List[List[str]] Divides a list of sequences into batches and processes each sequence for compatibility with the model.
Method Details¶
-
move_to_device(device: str): Sets the device (
cpu
orcuda
) on which the model will run. Updates theself.device
attribute. -
add_head(head: torch.nn.Module): Adds a task-specific head module to the model, stored in
self.head
. This head will be applied in theforward()
method for additional processing. -
compute_representations(sequences: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor]: Generates vector representations for a list of input sequences. Either
average_pooling
orcls_token
can be used to extract representations, but not both simultaneously. Divides the sequences into batches and processes each batch. -
compute_batch(batch: List[str], average_pooling: bool, cls_token: Optional[bool] = False) -> List[torch.Tensor]: Processes a single batch of sequences, creating representations using either average pooling or CLS token extraction. Returns a list of tensor representations for each sequence in the batch.
-
dim() -> int: Returns the dimension of the representation layer based on the loaded model.
-
forward(batch, labels=None): Executes a forward pass through the model with the given batch, applying the optional head if present. Returns the output representation tensor.
-
max_len() -> int: Returns the maximum sequence length allowed by the model based on the
lab
attribute. -
print_trainable_parameters(): Calculates and prints the total number of model parameters, the number of trainable parameters, and the percentage of trainable parameters.
-
_load_model(model: str): Internal method that loads the specified pre-trained model and tokenizer from either Rostlab, Facebook, or ElnaggarLab. Sets model attributes such as
dimension
,tokenizer
, andmodel
. -
_divide_in_batches(sequences: List[str], batch_size: int) -> List[List[str]]: Splits a list of sequences into smaller batches according to the batch size. Adjusts sequences to meet model-specific formatting requirements.
This class is intended for users who need a flexible way to compute and handle representations from pre-trained models, supporting GPU (CUDA or MPS) or CPU computation.