Auto assist questionnaire

Auto-fill Questionnaire using Chain of Thought or Few-Shot Examples¶

This notebook showcases the application of few-shot examples in autofilling questionnaires. It utilizes a json file (risk_questionnaire_cot.json) to provide the LLM with example responses for some use-cases.

By leveraging these few-shot examples, we can enable seamless completion of lengthy questionnaires, minimizing manual effort and improving overall efficiency.

In [1]:

Copied!





from risk_atlas_nexus.blocks.inference import (
    RITSInferenceEngine,
    WMLInferenceEngine,
    OllamaInferenceEngine,
    VLLMInferenceEngine,
)
from risk_atlas_nexus.blocks.inference.params import (
    InferenceEngineCredentials,
    RITSInferenceEngineParams,
    WMLInferenceEngineParams,
    OllamaInferenceEngineParams,
    VLLMInferenceEngineParams,
)

from risk_atlas_nexus.data import load_resource
from risk_atlas_nexus.library import RiskAtlasNexus
from risk_atlas_nexus.blocks.inference import (
    RITSInferenceEngine,
    WMLInferenceEngine,
    OllamaInferenceEngine,
    VLLMInferenceEngine,
)
from risk_atlas_nexus.blocks.inference.params import (
    InferenceEngineCredentials,
    RITSInferenceEngineParams,
    WMLInferenceEngineParams,
    OllamaInferenceEngineParams,
    VLLMInferenceEngineParams,
)

from risk_atlas_nexus.data import load_resource
from risk_atlas_nexus.library import RiskAtlasNexus

/Users/seshu/Documents/2025/risk-atlas-nexus/src/risk_atlas_nexus/toolkit/job_utils.py:4: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm

Risk Atlas Nexus uses Large Language Models (LLMs) to infer risks dimensions. Therefore requires access to LLMs to inference or call the model.¶

Available Inference Engines: WML, Ollama, vLLM, RITS. Please follow the Inference APIs guide before going ahead.

Note: RITS is intended solely for internal IBM use and requires TUNNELALL VPN for access.

In [2]:

Copied!





inference_engine = OllamaInferenceEngine(
    model_name_or_path="granite3.2:8b",
    credentials=InferenceEngineCredentials(api_url="http://localhost:11434"),
    parameters=OllamaInferenceEngineParams(
        num_predict=1000, temperature=0, repeat_penalty=1, num_ctx=8192
    ),
)

# inference_engine = WMLInferenceEngine(
#     model_name_or_path="ibm/granite-20b-code-instruct",
#     credentials={
#         "api_key": "WML_API_KEY",
#         "api_url": "WML_API_URL",
#         "project_id": "WML_PROJECT_ID",
#     },
#     parameters=WMLInferenceEngineParams(
#         max_new_tokens=1000, decoding_method="greedy", repetition_penalty=1
#     ),
# )

# inference_engine = VLLMInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials=InferenceEngineCredentials(
#         api_url="VLLM_API_URL", api_key="VLLM_API_KEY"
#     ),
#     parameters=VLLMInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )

# inference_engine = RITSInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials={
#         "api_key": "RITS_API_KEY",
#         "api_url": "RITS_API_URL",
#     },
#     parameters=RITSInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )
inference_engine = OllamaInferenceEngine(
    model_name_or_path="granite3.2:8b",
    credentials=InferenceEngineCredentials(api_url="http://localhost:11434"),
    parameters=OllamaInferenceEngineParams(
        num_predict=1000, temperature=0, repeat_penalty=1, num_ctx=8192
    ),
)

# inference_engine = WMLInferenceEngine(
#     model_name_or_path="ibm/granite-20b-code-instruct",
#     credentials={
#         "api_key": "WML_API_KEY",
#         "api_url": "WML_API_URL",
#         "project_id": "WML_PROJECT_ID",
#     },
#     parameters=WMLInferenceEngineParams(
#         max_new_tokens=1000, decoding_method="greedy", repetition_penalty=1
#     ),
# )

# inference_engine = VLLMInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials=InferenceEngineCredentials(
#         api_url="VLLM_API_URL", api_key="VLLM_API_KEY"
#     ),
#     parameters=VLLMInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )

# inference_engine = RITSInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials={
#         "api_key": "RITS_API_KEY",
#         "api_url": "RITS_API_URL",
#     },
#     parameters=RITSInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )

[2025-09-10 07:48:03:128] - INFO - RiskAtlasNexus - OLLAMA inference engine will execute requests on the server at http://localhost:11434.
[2025-09-10 07:48:03:157] - INFO - RiskAtlasNexus - Created OLLAMA inference engine.

Create an instance of RiskAtlasNexus¶

Note: (Optional) You can specify your own directory in RiskAtlasNexus(base_dir=<PATH>) to utilize custom AI ontologies. If left blank, the system will use the provided AI ontologies.

In [3]:

Copied!

risk_atlas_nexus = RiskAtlasNexus()
risk_atlas_nexus = RiskAtlasNexus()

[2025-09-10 07:48:03:379] - INFO - RiskAtlasNexus - Created RiskAtlasNexus instance. Base_dir: None

Defining Examples for Auto-Assist Functionality¶

The auto-assist feature utilizes few-shot examples defined in the file risk_atlas_nexus/data/templates/risk_questionnaire_cot.json to predict the output of the risk questionnaire.

Customization:

To adapt this auto-assist functionality to custom risk questionnaire, users need to provide their own set of questions, example intents, and corresponding answers in a json file such as in risk_questionnaire_cot.json. This will enable the LLM to learn from these few-shot examples and generate responses for unseen queries.

CoT Template - Zero Shot method

Each question is accompanied by corresponding examples provided as an empty list.

  [
      {
          "question": "In which environment is the system used?",
          "cot_examples": []
      }
      ...
  ]

CoT Template - Few Shot method

Each question is associated with a list of examples, each containing intent, answer, and optional explanation.

  [
      {
          "question": "In which environment is the system used?",
          "cot_examples": [
            {
              "intent": "Find patterns in healthcare insurance claims",
              "answer": "Insurance Claims Processing or Risk Management or Data Analytics",
              "explanation": "The system might be used by an insurance company's claims processing department to analyze and identify patterns in healthcare insurance claims."
            },
            {
                "intent": "optimize supply chain management in Investment banks",
                "answer": "Treasury Departments or Asset Management Divisions or Private Banking Units",
                "explanation": null
            },
            ...
          ]
      }
      ...
  ]

In this notebook, we're using a simplified template to cover 7 questions from the Airo questionnaire:

AI Domain
System environment
Utilized techniques
Intended User
Intended Purpose
System Application
AI Subject

Load Risk Questionnaire¶

Note: The cell below loads examples of risk questionnaires from Risk Atlas Master. To load your custom questionnaire, create it according to the specified format and load it instead.

In [4]:

Copied!

risk_questionnaire = load_resource("risk_questionnaire_cot.json")

risk_questionnaire[0]
risk_questionnaire = load_resource("risk_questionnaire_cot.json")

risk_questionnaire[0]

Out[4]:

{'no': 'Q1',
 'question': 'What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other',
 'cot_examples': [{'intent': 'Optimize supply chain management in Investment banks',
   'answer': 'Strategy',
   'confidence': 'Likely answer from the intent',
   'explanation': 'Since the task is involved in improving the processes to ensure better performance. It is not finance since the task is on supply chain optimization and not on financial aspects even though the application domain is banks.'},
  {'intent': 'Ability to create dialog flows and integrations from natural language instructions.',
   'answer': 'Customer service/support',
   'confidence': 'Likely answer from the intent',
   'explanation': 'Since the task relates to human conversations or generating human converstations or support.'},
  {'intent': 'Check if a document has grammatical mistakes.',
   'answer': 'Writing assitant',
   'confidence': 'Likely answer from the intent',
   'explanation': 'Since this helps in improving the quality of text. It is not customer service since this on on the quality of text rather than helping in human conversations.'},
  {'intent': 'Optimize supply chain management in Investment banks',
   'answer': 'Strategy',
   'confidence': 'Likely answer from the intent',
   'explanation': 'Since the task is involved in improving the processes to ensure better performance. It is not finance since the task is on supply chain optimization and not on financial aspects even though the application domain is banks.'},
  {'intent': "In the context of drug repurposing, generative AI can be employed to analyze vast databases of existing drugs and their clinical trials data. By identifying patterns and similarities, the AI can suggest potential new therapeutic indications for existing drugs, based on the chemical structure and pharmacological properties of the APIs. This process can help streamline the drug development pipeline, as it would reduce the need for time-consuming and expensive clinical trials for new indications. For instance, a drug like Atorvastatin, which is currently used to lower cholesterol, could be repurposed for the treatment of diabetic nephropathy, a kidney disease, based on the AI's analysis of similar drugs and their clinical data. This would not only save resources but also provide new treatment options for patients suffering from this debilitating condition. ",
   'answer': 'Healthcare and strategy',
   'confidence': 'Directly from the  input text',
   'explanation': 'Since the task is related to healthcare and drug repurposing, which involves analyzing data related to drugs and their clinical trials, this falls under the healthcare domain. It also involves Strategy it talks about using patterns to create new treatment options.'}]}

There are two ways to use the inference engine to get the LLM outputs. generate_zero_shot_risk_questionnaire_output which gives the zero-shot output for the question and generate_few_shot_risk_questionnaire_output which gives the output using few-shot examples defined above.

Auto Assist Questionnaire - Zero Shot¶

In [5]:

Copied!





usecase = "Generate personalized, relevant responses, recommendations, and summaries of claims for customers to support agents to enhance their interactions with customers. Include whether the information is in the prompt, a plausible answer or not able to answer based on the prompt."

results = risk_atlas_nexus.generate_zero_shot_risk_questionnaire_output(
    usecase, risk_questionnaire, inference_engine
)

# Display Results
for index, (question_data, result) in enumerate(
    zip(risk_questionnaire, results), start=1
):
    print(
        f"\n{index}: "
        + question_data["question"]
        + "\nA: "
        + result.prediction["answer"]
    )
usecase = "Generate personalized, relevant responses, recommendations, and summaries of claims for customers to support agents to enhance their interactions with customers. Include whether the information is in the prompt, a plausible answer or not able to answer based on the prompt."

results = risk_atlas_nexus.generate_zero_shot_risk_questionnaire_output(
    usecase, risk_questionnaire, inference_engine
)

# Display Results
for index, (question_data, result) in enumerate(
    zip(risk_questionnaire, results), start=1
):
    print(
        f"\n{index}: "
        + question_data["question"]
        + "\nA: "
        + result.prediction["answer"]
    )

Inferring with OLLAMA: 100%|██████████| 7/7 [00:54<00:00,  7.84s/it]

1: What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other
A: Customer service/support

2: In which environment is the system used?
A: The system is used in a digital environment, specifically as an AI assistant providing responses to user queries.

3: What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning
A: The system employs a variety of techniques including multi-modal capabilities such as Document Question/Answering, Image and text to text, Video and text to text, and visual question answering. In the realm of Natural Language Processing, it utilizes methods like feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, and zero shot classification. For computer vision, it incorporates image classification, image segmentation, text to image, and object detection. In the audio domain, it uses audio classification, audio to audio, and text to speech. Additionally, it includes tabular methods like tabular classification and tabular regression. Lastly, it incorporates reinforcement learning techniques.

4: Who is the intended user of the system?
A: The intended user of the system is the customer service agent.

5: What is the intended purpose of the system?
A: The intended purpose of the system is to assist as a compliance officer, providing personalized, relevant responses, recommendations, and summaries of claims to support agents in their interactions with customers. This is inferred from the Intent, which specifies generating personalized content to enhance customer interactions.

6: What is the application of the system?
A: The application of the system is not explicitly provided in the given prompt. Therefore, I cannot generate a personalized, relevant response regarding its application.

7: Who is the subject as per the intent?
A: The subject in this context is the compliance officer role.

In [6]:

Copied!





# for index, (question_data, result) in enumerate(
#     zip(risk_questionnaire, results), start=1
# ):
#     print(
#         f"\n{index}: "
#         + question_data["question"]
#         + "\nA: "
#         + result.prediction["answer"]
#         + "\nC: "
#         + result.prediction["Confidence"]
#     )
# for index, (question_data, result) in enumerate(
#     zip(risk_questionnaire, results), start=1
# ):
#     print(
#         f"\n{index}: "
#         + question_data["question"]
#         + "\nA: "
#         + result.prediction["answer"]
#         + "\nC: "
#         + result.prediction["Confidence"]
#     )

Auto Assist Questionnaire - Few Shot¶

In [7]:

Copied!

import json 

usecases = load_resource("questionnaire_benchmark.json")

predictions = []
import json 

usecases = load_resource("questionnaire_benchmark.json")

predictions = []

In [8]:

Copied!

index_map = {0: "UseCase", 1: "Domain", 2: "Environment", 3: "Techniques_Utilised", 4: "Intended_User", 5: "Purpose", 6: "Application", 7: "Subject" }
index_map = {0: "UseCase", 1: "Domain", 2: "Environment", 3: "Techniques_Utilised", 4: "Intended_User", 5: "Purpose", 6: "Application", 7: "Subject" }

In [9]:

Copied!





# usecase = usecases[0]["UseCase"]
for usecase_dict in usecases[2:3]:
    usecase = usecase_dict["UseCase"] 
    results = risk_atlas_nexus.generate_few_shot_risk_questionnaire_output(
        usecase,
        risk_questionnaire,
        inference_engine,
    )
    
    result_dict = {}
    # Display Results
    for index, (question_data, result) in enumerate(
        zip(risk_questionnaire, results), start=1
    ):
        print(
            f"\n{index}: "
            + question_data["question"]
            + "\nA: "
            + result.prediction["answer"]
            + "\nC: "
            + result.prediction["confidence"]
        )
        print(result)
        result_dict[index_map[index]] = {}
        result_dict[index_map[index]]["answer"] = result.prediction["answer"]
        result_dict[index_map[index]]["explanation"] = result.prediction["explanation"]
        result_dict[index_map[index]]["confidence"] = result.prediction["confidence"]

        result_dict["usecase"] = usecase
        # print(result)
    
    predictions.append(result_dict)
# usecase = usecases[0]["UseCase"]
for usecase_dict in usecases[2:3]:
    usecase = usecase_dict["UseCase"] 
    results = risk_atlas_nexus.generate_few_shot_risk_questionnaire_output(
        usecase,
        risk_questionnaire,
        inference_engine,
    )
    
    result_dict = {}
    # Display Results
    for index, (question_data, result) in enumerate(
        zip(risk_questionnaire, results), start=1
    ):
        print(
            f"\n{index}: "
            + question_data["question"]
            + "\nA: "
            + result.prediction["answer"]
            + "\nC: "
            + result.prediction["confidence"]
        )
        print(result)
        result_dict[index_map[index]] = {}
        result_dict[index_map[index]]["answer"] = result.prediction["answer"]
        result_dict[index_map[index]]["explanation"] = result.prediction["explanation"]
        result_dict[index_map[index]]["confidence"] = result.prediction["confidence"]

        result_dict["usecase"] = usecase
        # print(result)
    
    predictions.append(result_dict)

Inferring with OLLAMA: 100%|██████████| 7/7 [01:31<00:00, 13.09s/it]

1: What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other
A: Marketing
C: Likely answer from the intent
TextGenerationInferenceOutput(prediction={'answer': 'Marketing', 'explanation': 'Since the task involves analyzing social media data to understand public sentiment, which is a key aspect of marketing strategies and customer understanding.', 'confidence': 'Likely answer from the intent'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

2: In which environment is the system used?
A: Social Media Monitoring Teams or Public Relations Departments or Customer Service Teams
C: Likely answer from the intent
TextGenerationInferenceOutput(prediction={'answer': 'Social Media Monitoring Teams or Public Relations Departments or Customer Service Teams', 'explanation': '1. Social Media Monitoring Teams: Companies or organizations with dedicated social media monitoring teams could use this system to analyze public sentiment towards their brand, products, or services. 2. Public Relations Departments: PR departments might utilize this application to track and respond to public sentiment, manage crises, and develop effective communication strategies. 3. Customer Service Teams: Customer service teams could benefit from using this system to understand customer feedback, improve service quality, and address concerns proactively.', 'confidence': 'Likely answer from the intent'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

3: What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning
A: Natural language processing: text classification, sentiment analysis
C: Likely answer from the intent
TextGenerationInferenceOutput(prediction={'answer': 'Natural language processing: text classification, sentiment analysis', 'explanation': 'Sentiment Analysis for Social Media Monitoring involves analyzing text data from social media platforms to understand public opinion or sentiment towards a particular topic, brand, or product. This is achieved through Natural Language Processing techniques, specifically text classification and sentiment analysis. Text classification categorizes the text into predefined sentiment classes (e.g., positive, negative, neutral), while sentiment analysis determines the emotional tone behind the words.', 'confidence': 'Likely answer from the intent'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

4: Who is the intended user of the system?
A: Brands, Marketing Agencies, or Social Media Managers
C: Likely answer from the intent
TextGenerationInferenceOutput(prediction={'answer': 'Brands, Marketing Agencies, or Social Media Managers', 'explanation': 'Sentiment analysis for social media monitoring is primarily used by brands, marketing agencies, or social media managers to understand public opinion about their products, services, or brand image. By analyzing sentiments expressed in social media posts, these users can gain insights into customer satisfaction, brand reputation, and market trends. This information can help them make informed decisions about marketing strategies, product development, and customer service improvements.', 'confidence': 'Likely answer from the intent'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

5: What is the intended purpose of the system?
A: To monitor and analyze public sentiment expressed on social media platforms, providing insights into brand reputation, customer satisfaction, and market trends, enabling proactive management of online presence and crisis response.
C: High. The answer is directly inferred from the intent's focus on sentiment analysis for social media monitoring.
TextGenerationInferenceOutput(prediction={'answer': 'To monitor and analyze public sentiment expressed on social media platforms, providing insights into brand reputation, customer satisfaction, and market trends, enabling proactive management of online presence and crisis response.', 'explanation': "The intent focuses on sentiment analysis for social media monitoring, which implies the system's purpose is to track and interpret the emotions, opinions, and attitudes expressed by users on social media platforms. This analysis helps organizations understand their brand reputation, customer satisfaction, and market trends, allowing them to make informed decisions and respond effectively to potential crises.", 'confidence': "High. The answer is directly inferred from the intent's focus on sentiment analysis for social media monitoring."}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

6: What is the application of the system?
A: Natural Language Processing (NLP): Analyze social media posts and comments to determine public sentiment towards the brand, products, or services. This can help in understanding customer perceptions, identifying potential issues, and informing marketing strategies.
C: High
TextGenerationInferenceOutput(prediction={'answer': 'Natural Language Processing (NLP): Analyze social media posts and comments to determine public sentiment towards the brand, products, or services. This can help in understanding customer perceptions, identifying potential issues, and informing marketing strategies.', 'explanation': 'The intent focuses on Sentiment Analysis for Social Media Monitoring. This involves using NLP to process and interpret the emotional tone behind words in social media posts. The system would analyze the text to categorize it as positive, negative, or neutral, providing insights into public sentiment towards the brand, products, or services. This can be used to understand customer perceptions, identify potential issues, and inform marketing strategies.', 'confidence': 'High'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

7: Who is the subject as per the intent?
A: Social media users
C: Directly from the input text
TextGenerationInferenceOutput(prediction={'answer': 'Social media users', 'explanation': 'The system would need to analyze and interpret the sentiment of posts and comments made by users on social media platforms. This implies that the subject of the AI system is the social media users themselves, as their posts and interactions are the data being analyzed for sentiment.', 'confidence': 'Directly from the input text'}, input_tokens=None, output_tokens=None, stop_reason=None, seed=None, input_text=None, model_name_or_path='granite3.2:8b', inference_engine='OLLAMA')

In [10]:

Copied!

usecases[2] 
usecases[2]

Out[10]:

{'UseCase': 'Sentiment Analysis for Social Media Monitoring',
 'Domain': 'Marketing',
 'Environment': 'Social media platforms and marketing databases',
 'Techniques_Utilised': ['Natural Language Processing: Text Classification, Sentiment Analysis'],
 'Intended_User': 'Marketing Teams and Brand Managers',
 'Purpose': 'To gauge public opinion about products, services, or brands on social media platforms',
 'Application': 'Analyzing user-generated content to understand customer sentiment, track brand mentions, and identify trends',
 'Subject': 'Social media users discussing the monitored topics'}

In [12]:

Copied!

with open("./autoassist_questionnaire_v2.json", "w") as f:
    json.dump(predictions, f)
with open("./autoassist_questionnaire_v2.json", "w") as f:
    json.dump(predictions, f)

In [13]:

Copied!





import json

with open("./autoassist_questionnaire_v2.json") as f:
    autoassist_data = json.load(f)

usecase_autoassist = []
domain_autoassist = []
environment_autoassist = []
techniques_utilised_autoassist = []
intended_user_autoassist = []
purpose_autoassist = []
application_autoassist = []
subject_autoassist = []


for data in autoassist_data:
    domain_autoassist.append(data["Domain"])
    environment_autoassist.append(data["Environment"])
    techniques_utilised_autoassist.append(data["Techniques_Utilised"])
    intended_user_autoassist.append(data["Intended_User"])
    purpose_autoassist.append(data["Purpose"])
    application_autoassist.append(data["Application"])
    subject_autoassist.append(data["Subject"])
    usecase_autoassist.append(data["usecase"])
import json

with open("./autoassist_questionnaire_v2.json") as f:
    autoassist_data = json.load(f)

usecase_autoassist = []
domain_autoassist = []
environment_autoassist = []
techniques_utilised_autoassist = []
intended_user_autoassist = []
purpose_autoassist = []
application_autoassist = []
subject_autoassist = []


for data in autoassist_data:
    domain_autoassist.append(data["Domain"])
    environment_autoassist.append(data["Environment"])
    techniques_utilised_autoassist.append(data["Techniques_Utilised"])
    intended_user_autoassist.append(data["Intended_User"])
    purpose_autoassist.append(data["Purpose"])
    application_autoassist.append(data["Application"])
    subject_autoassist.append(data["Subject"])
    usecase_autoassist.append(data["usecase"])

In [14]:

Copied!





from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)

test_cases = []
question = "In which environment is the system used?"
for index, data in enumerate(environment_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)

test_cases = []
question = "In which environment is the system used?"
for index, data in enumerate(environment_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])

✨ You're running DeepEval's latest Likely relevance [GEval] Metric! (using gemma3n (Ollama), strict=False, 
async_mode=True)...

/Users/seshu/Documents/2025/risk-atlas-nexus/ran/lib/python3.11/site-packages/rich/live.py:231: UserWarning: 
install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')

======================================================================

Metrics Summary

  - ✅ Likely relevance [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by suggesting relevant teams that would benefit from a social media monitoring system. It provides specific examples (Social Media Monitoring Teams, PR Departments, Customer Service Teams) and explains how each team could utilize the system. The explanation is relevant, valid, and provides a clear rationale for each suggestion, demonstrating a strong understanding of the application's potential uses., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.In which environment is the system used?
  - actual output: Answer: Social Media Monitoring Teams or Public Relations Departments or Customer Service Teams.Explanation: 1. Social Media Monitoring Teams: Companies or organizations with dedicated social media monitoring teams could use this system to analyze public sentiment towards their brand, products, or services. 2. Public Relations Departments: PR departments might utilize this application to track and respond to public sentiment, manage crises, and develop effective communication strategies. 3. Customer Service Teams: Customer service teams could benefit from using this system to understand customer feedback, improve service quality, and address concerns proactively.
  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Likely relevance [GEval]: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

Out[14]:

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Likely relevance [GEval]', threshold=0.5, success=True, score=0.5, reason="The response directly addresses the prompt by suggesting relevant teams that would benefit from a social media monitoring system. It provides specific examples (Social Media Monitoring Teams, PR Departments, Customer Service Teams) and explains how each team could utilize the system. The explanation is relevant, valid, and provides a clear rationale for each suggestion, demonstrating a strong understanding of the application's potential uses.", strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nLikely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the actual output is relevant to the input.",\n    "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",\n    "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversational=False, multimodal=False, input='Intent: Sentiment Analysis for Social Media Monitoring.In which environment is the system used?', actual_output='Answer: Social Media Monitoring Teams or Public Relations Departments or Customer Service Teams.Explanation: 1. Social Media Monitoring Teams: Companies or organizations with dedicated social media monitoring teams could use this system to analyze public sentiment towards their brand, products, or services. 2. Public Relations Departments: PR departments might utilize this application to track and respond to public sentiment, manage crises, and develop effective communication strategies. 3. Customer Service Teams: Customer service teams could benefit from using this system to understand customer feedback, improve service quality, and address concerns proactively.', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

In [15]:

Copied!





from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other"
for index, data in enumerate(domain_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other"
for index, data in enumerate(domain_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])

✨ You're running DeepEval's latest Likely relevance [GEval] Metric! (using gemma3n (Ollama), strict=False, 
async_mode=True)...

======================================================================

Metrics Summary

  - ✅ Likely relevance [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the task of analyzing social media data to understand public sentiment, linking it to marketing strategies and customer understanding. The explanation is relevant and plausible, demonstrating a clear connection between the input (social media sentiment analysis) and the output (marketing). It correctly identifies marketing as a relevant field., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other
  - actual output: Answer: Marketing.Explanation: Since the task involves analyzing social media data to understand public sentiment, which is a key aspect of marketing strategies and customer understanding.
  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Likely relevance [GEval]: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

Out[15]:

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Likely relevance [GEval]', threshold=0.5, success=True, score=0.5, reason='The response directly addresses the task of analyzing social media data to understand public sentiment, linking it to marketing strategies and customer understanding. The explanation is relevant and plausible, demonstrating a clear connection between the input (social media sentiment analysis) and the output (marketing). It correctly identifies marketing as a relevant field.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nLikely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the actual output is relevant to the input.",\n    "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",\n    "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversational=False, multimodal=False, input='Intent: Sentiment Analysis for Social Media Monitoring.What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other', actual_output='Answer: Marketing.Explanation: Since the task involves analyzing social media data to understand public sentiment, which is a key aspect of marketing strategies and customer understanding.', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

In [16]:

Copied!





from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning"
for index, data in enumerate(techniques_utilised_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning"
for index, data in enumerate(techniques_utilised_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])

✨ You're running DeepEval's latest Likely relevance [GEval] Metric! (using gemma3n (Ollama), strict=False, 
async_mode=True)...

======================================================================

Metrics Summary

  - ✅ Likely relevance [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by identifying natural language processing applications in text classification and sentiment analysis. The explanation is relevant, providing a clear description of sentiment analysis for social media monitoring and how it utilizes text classification and sentiment analysis techniques. It correctly links these concepts to NLP and explains their roles in understanding public opinion., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning
  - actual output: Answer: Natural language processing: text classification, sentiment analysis.Explanation: Sentiment Analysis for Social Media Monitoring involves analyzing text data from social media platforms to understand public opinion or sentiment towards a particular topic, brand, or product. This is achieved through Natural Language Processing techniques, specifically text classification and sentiment analysis. Text classification categorizes the text into predefined sentiment classes (e.g., positive, negative, neutral), while sentiment analysis determines the emotional tone behind the words.
  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Likely relevance [GEval]: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

Out[16]:

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Likely relevance [GEval]', threshold=0.5, success=True, score=0.5, reason='The response directly addresses the prompt by identifying natural language processing applications in text classification and sentiment analysis. The explanation is relevant, providing a clear description of sentiment analysis for social media monitoring and how it utilizes text classification and sentiment analysis techniques. It correctly links these concepts to NLP and explains their roles in understanding public opinion.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nLikely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the actual output is relevant to the input.",\n    "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",\n    "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversational=False, multimodal=False, input='Intent: Sentiment Analysis for Social Media Monitoring.What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning', actual_output='Answer: Natural language processing: text classification, sentiment analysis.Explanation: Sentiment Analysis for Social Media Monitoring involves analyzing text data from social media platforms to understand public opinion or sentiment towards a particular topic, brand, or product. This is achieved through Natural Language Processing techniques, specifically text classification and sentiment analysis. Text classification categorizes the text into predefined sentiment classes (e.g., positive, negative, neutral), while sentiment analysis determines the emotional tone behind the words.', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

In [19]:

Copied!





from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What is the application of the system?"
for index, data in enumerate(application_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What is the application of the system?"
for index, data in enumerate(application_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])

✨ You're running DeepEval's latest Likely relevance [GEval] Metric! (using gemma3n (Ollama), strict=False, 
async_mode=True)...

======================================================================

Metrics Summary

  - ✅ Likely relevance [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The output directly addresses the prompt by explaining how Natural Language Processing (NLP) can be used for sentiment analysis of social media posts. It accurately describes the process of analyzing text to determine public sentiment, identifying potential issues, and informing marketing strategies. The explanation is relevant, valid, and covers the key aspects of the task., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.What is the application of the system?
  - actual output: Answer: Natural Language Processing (NLP): Analyze social media posts and comments to determine public sentiment towards the brand, products, or services. This can help in understanding customer perceptions, identifying potential issues, and informing marketing strategies..Explanation: The intent focuses on Sentiment Analysis for Social Media Monitoring. This involves using NLP to process and interpret the emotional tone behind words in social media posts. The system would analyze the text to categorize it as positive, negative, or neutral, providing insights into public sentiment towards the brand, products, or services. This can be used to understand customer perceptions, identify potential issues, and inform marketing strategies.
  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Likely relevance [GEval]: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

Out[19]:

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Likely relevance [GEval]', threshold=0.5, success=True, score=0.5, reason='The output directly addresses the prompt by explaining how Natural Language Processing (NLP) can be used for sentiment analysis of social media posts. It accurately describes the process of analyzing text to determine public sentiment, identifying potential issues, and informing marketing strategies. The explanation is relevant, valid, and covers the key aspects of the task.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nLikely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the actual output is relevant to the input.",\n    "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",\n    "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversational=False, multimodal=False, input='Intent: Sentiment Analysis for Social Media Monitoring.What is the application of the system?', actual_output='Answer: Natural Language Processing (NLP): Analyze social media posts and comments to determine public sentiment towards the brand, products, or services. This can help in understanding customer perceptions, identifying potential issues, and informing marketing strategies..Explanation: The intent focuses on Sentiment Analysis for Social Media Monitoring. This involves using NLP to process and interpret the emotional tone behind words in social media posts. The system would analyze the text to categorize it as positive, negative, or neutral, providing insights into public sentiment towards the brand, products, or services. This can be used to understand customer perceptions, identify potential issues, and inform marketing strategies.', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)

In [20]:

Copied!





from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "Who is the subject as per the intent?"
for index, data in enumerate(subject_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Likely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

likely_relevance_metric = GEval(
    name="Likely relevance",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "Who is the subject as per the intent?"
for index, data in enumerate(subject_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[likely_relevance_metric])

✨ You're running DeepEval's latest Likely relevance [GEval] Metric! (using gemma3n (Ollama), strict=False, 
async_mode=True)...

======================================================================

Metrics Summary

  - ✅ Likely relevance [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by identifying social media users as the subject of sentiment analysis. The explanation clearly connects the analysis of posts and comments to the users themselves, demonstrating a strong understanding of the task's core requirement. It's relevant, valid, and a probable correct interpretation., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.Who is the subject as per the intent?
  - actual output: Answer: Social media users.Explanation: The system would need to analyze and interpret the sentiment of posts and comments made by users on social media platforms. This implies that the subject of the AI system is the social media users themselves, as their posts and interactions are the data being analyzed for sentiment.
  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Likely relevance [GEval]: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

Out[20]:

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Likely relevance [GEval]', threshold=0.5, success=True, score=0.5, reason="The response directly addresses the prompt by identifying social media users as the subject of sentiment analysis. The explanation clearly connects the analysis of posts and comments to the users themselves, demonstrating a strong understanding of the task's core requirement. It's relevant, valid, and a probable correct interpretation.", strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nLikely relevance (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and Likely relevance whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the actual output is relevant to the input.",\n    "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",\n    "Assign a score for Likely relevance on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."\n] \n \nRubric:\nNone \n \nScore: 0.5')], conversational=False, multimodal=False, input='Intent: Sentiment Analysis for Social Media Monitoring.Who is the subject as per the intent?', actual_output='Answer: Social media users.Explanation: The system would need to analyze and interpret the sentiment of posts and comments made by users on social media platforms. This implies that the subject of the AI system is the social media users themselves, as their posts and interactions are the data being analyzed for sentiment.', expected_output=None, context=None, retrieval_context=None, additional_metadata=None)], confident_link=None)