ARES with Guardrails

Optionally, the target node may include input and output guardrails. These act as filters for prompts sent to the model (input guardrail) and outputs returned by the model (output guardrail).

The input guardrail filters user requests made to the target connector. It attempts to identify and reject jailbreaks before the model is exposed to them.

The output guardrail filters responses from the target connector. It aims to detect and block harmful or complicit outputs resulting from jailbreak attempts.

Both guardrails can be configured using Granite Guardian.

Guardrails must be defined using a Connector. ARES includes support for IBM’s Granite Guardian and Google’s ShieldGemma.

Below is an example configuration using ibm-granite/granite-3.3-8b-instruct as the target model with Granite Guardian as both input and output guardrails:

target:
  huggingface:  # the target connector
    model_config:
      pretrained_model_name_or_path: ibm-granite/granite-3.3-8b-instruct
      torch_dtype: bfloat16
    tokenizer_config:
      padding_side: left
      pretrained_model_name_or_path: ibm-granite/granite-3.3-8b-instruct

input_guardrail:  # the input guardrail
  response: I'm sorry I cannot comply
  granite-guardian-3.0:

output_guardrail:  # the output guardrail
  response: I'm sorry I cannot comply
  granite-guardian-3.0: